[gcc r15-628] tree-into-ssa: speed up sorting in prune_unused_phi_nodes [PR114480]

2024-05-17 Thread Alexander Monakov via Gcc-cvs
https://gcc.gnu.org/g:4b9e68a6f3b22800a7f12b58ef6b25e3b339bb3c

commit r15-628-g4b9e68a6f3b22800a7f12b58ef6b25e3b339bb3c
Author: Alexander Monakov 
Date:   Wed May 15 16:23:17 2024 +0300

tree-into-ssa: speed up sorting in prune_unused_phi_nodes [PR114480]

In PR 114480 we are hitting a case where tree-into-ssa scales
quadratically due to prune_unused_phi_nodes doing O(N log N)
work for N basic blocks, for each variable individually.
Sorting the 'defs' array is especially costly.

It is possible to assist gcc_qsort by laying out dfs_out entries
in the reverse order in the 'defs' array, starting from its tail.
This is not always a win (in fact it flips most of 7-element qsorts
in this testcase from 9 comparisons (best case) to 15 (worst case)),
but overall it helps on the testcase and on libstdc++ build.
On the testcase we go from 1.28e9 comparator invocations to 1.05e9,
on libstdc++ from 2.91e6 to 2.84e6.

gcc/ChangeLog:

PR c++/114480
* tree-into-ssa.cc (prune_unused_phi_nodes): Add dfs_out entries
to the 'defs' array in the reverse order.

Diff:
---
 gcc/tree-into-ssa.cc | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-into-ssa.cc b/gcc/tree-into-ssa.cc
index 3732c269ca3d..5b367c358125 100644
--- a/gcc/tree-into-ssa.cc
+++ b/gcc/tree-into-ssa.cc
@@ -805,21 +805,22 @@ prune_unused_phi_nodes (bitmap phis, bitmap kills, bitmap 
uses)
  locate the nearest dominating def in logarithmic time by binary search.*/
   bitmap_ior (to_remove, kills, phis);
   n_defs = bitmap_count_bits (to_remove);
-  defs = XNEWVEC (struct dom_dfsnum, 2 * n_defs + 1);
+  adef = 2 * n_defs + 1;
+  defs = XNEWVEC (struct dom_dfsnum, adef);
   defs[0].bb_index = 1;
   defs[0].dfs_num = 0;
-  adef = 1;
+  struct dom_dfsnum *head = defs + 1, *tail = defs + adef;
   EXECUTE_IF_SET_IN_BITMAP (to_remove, 0, i, bi)
 {
   def_bb = BASIC_BLOCK_FOR_FN (cfun, i);
-  defs[adef].bb_index = i;
-  defs[adef].dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
-  defs[adef + 1].bb_index = i;
-  defs[adef + 1].dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
-  adef += 2;
+  head->bb_index = i;
+  head->dfs_num = bb_dom_dfs_in (CDI_DOMINATORS, def_bb);
+  head++, tail--;
+  tail->bb_index = i;
+  tail->dfs_num = bb_dom_dfs_out (CDI_DOMINATORS, def_bb);
 }
+  gcc_checking_assert (head == tail);
   BITMAP_FREE (to_remove);
-  gcc_assert (adef == 2 * n_defs + 1);
   qsort (defs, adef, sizeof (struct dom_dfsnum), cmp_dfsnum);
   gcc_assert (defs[0].bb_index == 1);


Re: issue: unexpected results in optimizations

2023-12-12 Thread Alexander Monakov via Gcc


On Tue, 12 Dec 2023, Jonathan Wakely via Gcc wrote:

> On Mon, 11 Dec 2023, 17:08 Jingwen Wu via Gcc,  wrote:
> 
> > Hello, I'm sorry to bother you. And I have some gcc compiler optimization
> > questions to ask you.
> > First of all, I used csmith tools to generate c files randomly. Meanwhile,
> > the final running result was the checksum for global variables in a c file.
> > For the two c files in the attachment, I performed the equivalent
> > transformation of loop from *initial.**c* to *transformed.c*. And the two
> > files produced different results (i.e. different checksum values) when
> > using *-O2* optimization level, while the results of both were the same
> > when using other levels of optimization such as *-O0*, *-O1*, *-O3*, *-Os*,
> > *-Ofast*.
> > Please help me to explain why this is, thank you.
> >
> 
> Sometimes csmith can generate invalid code that gets miscompiled. It looks
> like you're compiling with no warnings, which is a terrible idea:
> 
> 
> > command line: *gcc file.c -O2 -lm -I $CSMITH_HOME/include && ./a.out*
> >
> 
> You should **at least** enable warnings and make sure gcc isn't pointing
> out any problems in the code.
> 
> You should also try the options suggested at http://gcc.gnu.org/bugs/ which
> help identify invalid code.

Let me also link the "Testing Compilers Using Csmith" page, which is
currently available via the Wayback Machine, but not its original URL:

https://web.archive.org/web/20230316072811/http://embed.cs.utah.edu/csmith/using.html

It was written by the developers of Csmith.

Alexander


NOP_EXPR vs. CONVERT_EXPR

2023-12-05 Thread Alexander Monakov via Gcc
Greetings,

the definitions for NOP_EXPR and CONVERT_EXPR in tree.def, having survived
all the way from 1992, currently say:

/* Represents a conversion of type of a value.
   All conversions, including implicit ones, must be
   represented by CONVERT_EXPR or NOP_EXPR nodes.  */
DEFTREECODE (CONVERT_EXPR, "convert_expr", tcc_unary, 1)

/* Represents a conversion expected to require no code to be generated.  */
DEFTREECODE (NOP_EXPR, "nop_expr", tcc_unary, 1)

Unfortunately, they are confusing, as in

float f(double d)
{
return d;
}

the narrowing conversion is represented with NOP_EXPR, and it is definitely
not a no-op.

Does some clear distinction remain, and is it possible to clarify the
definitions?

Thanks.
Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches



On Tue, 11 Jul 2023, Michael Matz wrote:

> Hey,
> 
> On Tue, 11 Jul 2023, Alexander Monakov via Gcc-patches wrote:
> 
> > > > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > > > 
> > > > This is the weak/active form; I'd suggest "preserve_high_sse".
> > > 
> > > But it preserves only the low parts :-)  You swapped the two in your 
> > > mind when writing the reply?
> > 
> > Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
> > the higher halves of (unspecified subset of) SSE regs.
> 
> Ah, gotcha :-)  It just shows that all these names are confusing.  Maybe 
> I'll just go with "attribute1" and "attribute2" and rely on docu.  (SCNR)

Heh, that reminds me that decimal digits are allowed in attribute names.
Let me offer "preserve_xmm_8_15" and "only_xmm_0_7" then.

One more thing to keep in mind is interaction with SSE-AVX transition.
If the function with a new attribute is using classic non-VEX-encoded SSE,
but its caller is using 256-bit ymm0-15, it will incur a substantial penalty
on Intel CPUs. There's no penalty on AMD (afaik) and no penalty for zmm16-31,
since those are inaccessible in non-EVEX code.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches


On Tue, 11 Jul 2023, Michael Matz wrote:

> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > 
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> But it preserves only the low parts :-)  You swapped the two in your 
> mind when writing the reply?

Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
the higher halves of (unspecified subset of) SSE regs.

If you look from AVX viewpoint, yes, it preserves lower 128 bits of the
high-numbered vector registers.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches


On Tue, 11 Jul 2023, Richard Biener wrote:

> > > If a function contains calls then GCC can't know which
> > > parts of the XMM regset is clobbered by that, it may be parts
> > > which don't even exist yet (say until avx2048 comes out), so we must
> > > restrict ourself to only save/restore the SSE2 parts and then of course
> > > can only claim to not clobber those parts.
> >
> > Hm, I guess this is kinda the reason a "weak" form is needed. But this
> > highlights the difference between the two: the "weak" form will actively
> > preserve some state (so it cannot preserve future extensions), while
> > the "strong" form may just passively not touch any state, preserving
> > any state it doesn't know about.
> >
> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> >
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> Isn't it the opposite?  "preserves_low_sse", unless you suggest
> the name applies to the caller which has to preserve high parts
> when calling nosseclobber.

This is the form where the function annnotated with this attribute
consumes 128 bytes on the stack to "blindly" save/restore xmm8-15
if it calls anything with a vanilla ABI.

(actually thinking about it more, I'd like to suggest shelving this part
and only implement the zero-cost variant, noanysseclobber)

> > > * noanysseclobber: claims (and ensures) that nothing of any of the
> > >   registers overlapping xmm8-15 is clobbered (not even future, as of
> > >   yet unknown, parts)
> >
> > This is the strong/passive form; I'd suggest "only_low_sse".
> 
> Likewise.

Sorry if I managed to sow confusion here. In my mind, this is the form where
only xmm0-xmm7 can be written in the function annotated with the attribute,
including its callees. I was thinking that writing to zmm16-31 would be
disallowed too. The initial example was memcpy, where eight vector registers
are sufficient for the job.

> As for mask registers I understand we'd have to split the 8 register
> set into two halves to make the same approach work, otherwise
> we'd have no registers left to allocate from.

I'd suggest to look how many mask registers OpenMP SIMD AVX-512 clones
can receive as implicit arguments, as one data point.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches
On Mon, 10 Jul 2023, Alexander Monakov wrote:

> > I chose to make it possible to write function definitions with that
> > attribute with GCC adding the necessary callee save/restore code in
> > the xlogue itself.
> 
> But you can't trivially restore if the callee is sibcalling — what
> happens then (a testcase might be nice)?

Sorry, when the caller is doing the sibcall, not the callee.

Alexander


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, Michael Matz via Gcc-patches wrote:

> Hello,
> 
> the ELF psABI for x86-64 doesn't have any callee-saved SSE
> registers (there were actual reasons for that, but those don't
> matter anymore).  This starts to hurt some uses, as it means that
> as soon as you have a call (say to memmove/memcpy, even if
> implicit as libcall) in a loop that manipulates floating point
> or vector data you get saves/restores around those calls.
> 
> But in reality many functions can be written such that they only need
> to clobber a subset of the 16 XMM registers (or do the save/restore
> themself in the codepaths that needs them, hello memcpy again).
> So we want to introduce a way to specify this, via an ABI attribute
> that basically says "doesn't clobber the high XMM regs".

I think the main question is why you're going with this (weak) form
instead of the (strong) form "may only clobber the low XMM regs":
as Richi noted, surely for libcalls we'd like to know they preserve
AVX-512 mask registers as well?

(I realize this is partially answered later)

Note this interacts with anything that interposes between the caller
and the callee, like the Glibc lazy binding stub (which used to
zero out high halves of 512-bit arguments in ZMM registers).
Not an immediate problem for the patch, just something to mind perhaps.

> I've opted to do only the obvious: do something special only for
> xmm8 to xmm15, without a way to specify the clobber set in more detail.
> I think such half/half split is reasonable, and as I don't want to
> change the argument passing anyway (whose regs are always clobbered)
> there isn't that much wiggle room anyway.
> 
> I chose to make it possible to write function definitions with that
> attribute with GCC adding the necessary callee save/restore code in
> the xlogue itself.

But you can't trivially restore if the callee is sibcalling — what
happens then (a testcase might be nice)?

> Carefully note that this is only possible for
> the SSE2 registers, as other parts of them would need instructions
> that are only optional.

What is supposed to happen on 32-bit x86 with -msse -mno-sse2?

> When a function doesn't contain calls to
> unknown functions we can be a bit more lenient: we can make it so that
> GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> necessary.

What if the source code has a local register variable bound to xmm15,
i.e. register double x asm("xmm15"); asm("..." : "+x"(x)); ?
Probably "dont'd do that", i.e. disallow that in the documentation?

> If a function contains calls then GCC can't know which
> parts of the XMM regset is clobbered by that, it may be parts
> which don't even exist yet (say until avx2048 comes out), so we must
> restrict ourself to only save/restore the SSE2 parts and then of course
> can only claim to not clobber those parts.

Hm, I guess this is kinda the reason a "weak" form is needed. But this
highlights the difference between the two: the "weak" form will actively
preserve some state (so it cannot preserve future extensions), while
the "strong" form may just passively not touch any state, preserving
any state it doesn't know about.

> To that end I introduce actually two related attributes (for naming
> see below):
> * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered

This is the weak/active form; I'd suggest "preserve_high_sse".

> * noanysseclobber: claims (and ensures) that nothing of any of the
>   registers overlapping xmm8-15 is clobbered (not even future, as of
>   yet unknown, parts)

This is the strong/passive form; I'd suggest "only_low_sse".

> Ensuring the first is simple: potentially add saves/restore in xlogue
> (e.g. when xmm8 is either used explicitely or implicitely by a call).
> Ensuring the second comes with more: we must also ensure that no
> functions are called that don't guarantee the same thing (in addition
> to just removing all xmm8-15 parts alltogether from the available
> regsters).
> 
> See also the added testcases for what I intended to support.
> 
> I chose to use the new target independend function-abi facility for
> this.  I need some adjustments in generic code:
> * the "default_abi" is actually more like a "current" abi: it happily
>   changes its contents according to conditional_register_usage,
>   and other code assumes that such changes do propagate.
>   But if that conditonal_reg_usage is actually done because the current
>   function is of a different ABI, then we must not change default_abi.
> * in insn_callee_abi we do look at a potential fndecl for a call
>   insn (only set when -fipa-ra), but doesn't work for calls through
>   pointers and (as said) is optional.  So, also always look at the
>   called functions type (it's always recorded in the MEM_EXPR for
>   non-libcalls), before asking the target.
>   (The function-abi accessors working on trees were already doing that,
>   its just the RTL accessor that missed this)
> 
> Accordingly I also implement some 

Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:

> False dependency happens when destination is only updated by
> pternlog. There is no false dependency when destination is also used
> in source. So either a pxor should be inserted, or input operand
> should be set with constraint '0'.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready to push to trunk.

Shouldn't this patch also remove uses of vpternlog in
standard_sse_constant_opcode?

A couple more questions below:

> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> ]
> (symbol_ref "true")))])
>  
> +; False dependency happens on destination register which is not really
> +; used when moving all ones to vector register
> +(define_split
> +  [(set (match_operand:VMOVE 0 "register_operand")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> +  "TARGET_AVX512F && reload_completed
> +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> +  && optimize_function_for_speed_p (cfun)"

Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
Doesn't it work here as well?

> +  [(set (match_dup 0) (match_dup 2))
> +   (parallel
> + [(set (match_dup 0) (match_dup 1))
> +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[2] = CONST0_RTX (mode);")
> +
> +(define_insn "*vmov_constm1_pternlog_false_dep"
> +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> ""))
> +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +   "TARGET_AVX512VL ||  == 64"
> +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> +  [(set_attr "type" "sselog1")
> +   (set_attr "prefix" "evex")])
> +
>  ;; If mem_addr points to a memory region with less than whole vector size 
> bytes
>  ;; of accessible memory and k is a mask that would prevent reading the 
> inaccessible
>  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed 
> to vpblendd
> @@ -9336,7 +9359,7 @@ (define_expand "_cvtmask2"
>  operands[3] = CONST0_RTX (mode);
>}")
>  
> -(define_insn "*_cvtmask2"
> +(define_insn_and_split "*_cvtmask2"
>[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
>   (vec_merge:VI48_AVX512VL
> (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> @@ -9346,11 +9369,35 @@ (define_insn "*_cvtmask2"
>"@
> vpmovm2\t{%1, %0|%0, %1}
> vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  "&& !TARGET_AVX512DQ && reload_completed
> +   && optimize_function_for_speed_p (cfun)"
> +  [(set (match_dup 0) (match_dup 4))
> +   (parallel
> +[(set (match_dup 0)
> +   (vec_merge:VI48_AVX512VL
> + (match_dup 2)
> + (match_dup 3)
> + (match_dup 1)))
> + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[4] = CONST0_RTX (mode);"
>[(set_attr "isa" "avx512dq,*")
> (set_attr "length_immediate" "0,1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>  
> +(define_insn "*_cvtmask2_pternlog_false_dep"
> +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> + (vec_merge:VI48_AVX512VL
> +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> +   (match_operand: 1 "register_operand" "Yk")))
> +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  [(set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "")])
> +
>  (define_expand "extendv2sfv2df2"
>[(set (match_operand:V2DF 0 "register_operand")
>   (float_extend:V2DF
> @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
>  operands[2] = force_reg (mode, operands[2]);
>  })
>  
> -(define_insn "one_cmpl2"
> -  [(set (match_operand:VI 0 "register_operand" "=v,v")
> - (xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
> - (match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
> +(define_insn_and_split "one_cmpl2"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v,v")
> + (xor:VI (match_operand:VI 1 "bcst_vector_operand" " 0, m,Br")
> + (match_operand:VI 2 "vector_all_ones_operand" "BC,BC,BC")))]
>"TARGET_AVX512F
> && (!
> || mode == SImode
> || mode == DImode)"
>  {
> +  if (! && which_alternative
> +  && optimize_function_for_speed_p (cfun))
> +return "#";
> +
>if (TARGET_AVX512VL)
>  return "vpternlog\t{$0x55, %1, %0, 
> %0|%0, %0, %1, 0x55}";
>else
>  return "vpternlog\t{$0x55, %g1, %g0, 
> %g0|%g0, %g0, %g1, 0x55}";
>  }
> +  "&& reload_completed && !REG_P (operands[1]) && !
> +   && optimize_function_for_speed_p (cfun)"
> 

Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-19 Thread Alexander Monakov via Gcc-patches


Ping. OK for trunk?

On Mon, 5 Jun 2023, Alexander Monakov wrote:

> Ping for the front-end maintainers' input.
> 
> On Mon, 22 May 2023, Richard Biener wrote:
> 
> > On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
> >  wrote:
> > >
> > > Implement -ffp-contract=on for C and C++ without changing default
> > > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> > 
> > The documentation changes mention the defaults are changed for
> > standard modes, I suppose you want to remove that hunk.
> > 
> > > gcc/c-family/ChangeLog:
> > >
> > > * c-gimplify.cc (fma_supported_p): New helper.
> > > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > > contraction.
> > >
> > > gcc/ChangeLog:
> > >
> > > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > > * doc/invoke.texi (-ffp-contract): Update.
> > > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > > ---
> > >  gcc/c-family/c-gimplify.cc | 78 ++
> > >  gcc/common.opt |  3 +-
> > >  gcc/config/sh/sh.md|  2 +-
> > >  gcc/doc/invoke.texi|  8 ++--
> > >  gcc/trans-mem.cc   |  3 ++
> > >  5 files changed, 88 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > > index ef5c7d919f..f7635d3b0c 100644
> > > --- a/gcc/c-family/c-gimplify.cc
> > > +++ b/gcc/c-family/c-gimplify.cc
> > > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "c-ubsan.h"
> > >  #include "tree-nested.h"
> > >  #include "context.h"
> > > +#include "tree-pass.h"
> > > +#include "internal-fn.h"
> > >
> > >  /*  The gimplification pass converts the language-dependent trees
> > >  (ld-trees) emitted by the parser into language-independent trees
> > > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > > body)
> > >return bind;
> > >  }
> > >
> > > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > > +
> > > +static bool
> > > +fma_supported_p (enum internal_fn fn, tree type)
> > > +{
> > > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > > +}
> > > +
> > >  /* Gimplification of expression trees.  */
> > >
> > >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > > ATTRIBUTE_UNUSED,
> > > break;
> > >}
> > >
> > > +case PLUS_EXPR:
> > > +case MINUS_EXPR:
> > > +  {
> > > +   tree type = TREE_TYPE (*expr_p);
> > > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > > +  during initial gimplification.  Late contraction across 
> > > statement
> > > +  boundaries would violate language semantics.  */
> > > +   if (SCALAR_FLOAT_TYPE_P (type)
> > > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > > +   && fma_supported_p (IFN_FMA, type))
> > > + {
> > > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > > +
> > > +   tree *op0_p = _OPERAND (*expr_p, 0);
> > > +   tree *op1_p = _OPERAND (*expr_p, 1);
> > > +
> > > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > > + {
> > > +   std::swap (op0_p, op1_p);
> > > +   std::swap (neg_mul, neg_add);
> > > + }
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > > + {
> > > +   op0_p = _OPERAND (*op0_p, 0);
> > > +   neg_mul 

Re: gcc tricore porting

2023-06-19 Thread Alexander Monakov via Gcc


On Mon, 19 Jun 2023, Mikael Pettersson via Gcc wrote:

> (Note I'm reading the gcc mailing list via the Web archives, which
> doesn't let me create "proper" replies. Oh well.)

(there's a public-inbox instance at https://inbox.sourceware.org/gcc/
but some messages are not available there)

Alexander


Re: Applying extended assembly parsing to basic asm

2023-06-13 Thread Alexander Monakov via Gcc


On Wed, 14 Jun 2023, Julian Waters via Gcc wrote:

> Hi all,
> 
> Would it be acceptable for a changeset that applies the parsing of string
> literals in extended assembly blocks to the assembly templates inside basic
> asm to be mailed here? As I see it, there is no reason to keep this
> behaviour from basic asm statements and it can prove to be very useful, and
> I'd be happy to submit a patch for this new behaviour to be implemented.
> Basic asm would continue to behave the same as before otherwise

That would be an incompatible change. Can you expand on the expected usefulness?

Alexander


Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-05 Thread Alexander Monakov via Gcc-patches
Ping for the front-end maintainers' input.

On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.
> 
> > gcc/c-family/ChangeLog:
> >
> > * c-gimplify.cc (fma_supported_p): New helper.
> > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > contraction.
> >
> > gcc/ChangeLog:
> >
> > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > * doc/invoke.texi (-ffp-contract): Update.
> > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > ---
> >  gcc/c-family/c-gimplify.cc | 78 ++
> >  gcc/common.opt |  3 +-
> >  gcc/config/sh/sh.md|  2 +-
> >  gcc/doc/invoke.texi|  8 ++--
> >  gcc/trans-mem.cc   |  3 ++
> >  5 files changed, 88 insertions(+), 6 deletions(-)
> >
> > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > index ef5c7d919f..f7635d3b0c 100644
> > --- a/gcc/c-family/c-gimplify.cc
> > +++ b/gcc/c-family/c-gimplify.cc
> > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "c-ubsan.h"
> >  #include "tree-nested.h"
> >  #include "context.h"
> > +#include "tree-pass.h"
> > +#include "internal-fn.h"
> >
> >  /*  The gimplification pass converts the language-dependent trees
> >  (ld-trees) emitted by the parser into language-independent trees
> > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > body)
> >return bind;
> >  }
> >
> > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > +
> > +static bool
> > +fma_supported_p (enum internal_fn fn, tree type)
> > +{
> > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > +}
> > +
> >  /* Gimplification of expression trees.  */
> >
> >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > ATTRIBUTE_UNUSED,
> > break;
> >}
> >
> > +case PLUS_EXPR:
> > +case MINUS_EXPR:
> > +  {
> > +   tree type = TREE_TYPE (*expr_p);
> > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > +  during initial gimplification.  Late contraction across statement
> > +  boundaries would violate language semantics.  */
> > +   if (SCALAR_FLOAT_TYPE_P (type)
> > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > +   && fma_supported_p (IFN_FMA, type))
> > + {
> > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > +
> > +   tree *op0_p = _OPERAND (*expr_p, 0);
> > +   tree *op1_p = _OPERAND (*expr_p, 1);
> > +
> > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + {
> > +   std::swap (op0_p, op1_p);
> > +   std::swap (neg_mul, neg_add);
> > + }
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > + {
> > +   op0_p = _OPERAND (*op0_p, 0);
> > +   neg_mul = !neg_mul;
> > + }
> > +   if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + break;
> > +   auto_vec ops (3);
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 0));
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 1));
> > +   ops.quick_push (*op1_p);
> > +
> > +   enum internal_fn ifn = IFN_FMA;
> > +   if (neg_mul)
> > + {
> > +   if (fma_supported_p (IFN_FNMA, type))
> > + ifn = IFN_F

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches


On Fri, 2 Jun 2023, Matthias Kretz wrote:

> > Okay, I see opinions will vary here. I was thinking about our immintrin.h
> > which is partially implemented in terms of generic vectors. Imagine we
> > extend UBSan to trap on signed overflow for vector types. I expect that
> > will blow up on existing code that uses Intel intrinsics.
> 
> _mm_add_epi32 is already implemented via __v4su addition (i.e. unsigned). So 
> the intrinsic would continue to wrap on signed overflow.

Ah, if our intrinsics take care of it, that alleviates my concern.

> > I'm not sure what you consider a breaking change here. Is that the implied
> > threat to use undefinedness for range deduction and other optimizations?
> 
> Consider the stdx::simd implementation. It currently follows semantics of the 
> builtin types. So simd can be shifted by 30 without UB. The 
> implementation of the shift operator depends on the current behavior, even if 
> it is target-dependent. For PPC the simd implementation adds extra code to 
> avoid the "UB". With nailing down shifts > sizeof(T) as UB this extra code 
> now 
> needs to be added for all targets.

What does stdx::simd do on LLVM, where that has always been UB even on x86?

Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches


On Fri, 2 Jun 2023, Matthias Kretz wrote:

> On Thursday, 1 June 2023 20:25:14 CEST Alexander Monakov wrote:
> > On Wed, 31 May 2023, Richard Biener wrote:
> > > So yes, we probably should clarify the semantics to match the
> > > implementation (since we have two targets doing things differently
> > > since forever we can only document it as UB) and also note the
> > > difference from OpenCL (in case OpenCL is still relevant these
> > > days we might want to offer a -fopencl-vectors to emit the required
> > > AND).
> > 
> > It doesn't have to be UB, in principle we could say that shift amount
> > is taken modulo some power of two depending on the target without UB.
> > But since LLVM already treats that as UB, we might as well follow.
> 
> I prefer UB (as your patch states ). If a user requires the AND, let them 
> state it explicitly. Don't let everybody pay in performance.

What I suggested does not imply a performance cost. All targets take some
lower bits of the shift amount anyway. It's only OpenCL's exact masking
that would imply a performance cost (and I agree it's inappropriate for
GCC's generic vectors).

> > I think for addition/multiplication of signed vectors everybody
> > expects them to have wrapping semantics without UB on overflow though?
> 
>   simd x = ...;
>   bool t = all_of(x < x + 1); // unconditionally true or not?
> 
> I'd expect t to be unconditionally true. Because simd simply is a data-
> parallel version of int.

Okay, I see opinions will vary here. I was thinking about our immintrin.h
which is partially implemented in terms of generic vectors. Imagine we
extend UBSan to trap on signed overflow for vector types. I expect that
will blow up on existing code that uses Intel intrinsics. But use of
generic vectors in immintrin.h is our implementation detail, and people
might have expected intrinsics to be overflow-safe, like for aliasing
(where we use __attribute__((may_alias)) in immintrin.h). Although, we
can solve that by inventing overflow-wraps attribute for types, maybe?

> > Revised patch below.
> 
> This can be considered a breaking change. Does it need a mention in the 
> release notes?

I'm not sure what you consider a breaking change here. Is that the implied
threat to use undefinedness for range deduction and other optimizations?

Thanks.
Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-01 Thread Alexander Monakov via Gcc-patches


On Wed, 31 May 2023, Richard Biener wrote:

> On Tue, May 30, 2023 at 4:49 PM Alexander Monakov  wrote:
> >
> >
> > On Thu, 25 May 2023, Richard Biener wrote:
> >
> > > On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  
> > > wrote:
> > > >
> > > >
> > > > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> > > >
> > > > > I’d have to check the ISAs what they actually do here - it of course 
> > > > > depends
> > > > > on RTL semantics as well but as you say those are not strictly 
> > > > > defined here
> > > > > either.
> > > >
> > > > Plus, we can add the following executable test to the testsuite:
> > >
> > > Yeah, that's probably a good idea.  I think your documentation change
> > > with the added sentence about the truncation is OK.
> >
> > I am no longer confident in my patch, sorry.
> >
> > My claim about vector shift semantics in OpenCL was wrong. In fact it 
> > specifies
> > that RHS of a vector shift is masked to the exact bitwidth of the element 
> > type.
> >
> > So, to collect various angles:
> >
> > 1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).
> >
> > 2. From user side we had a request to follow C integer promotion semantics
> >in https://gcc.gnu.org/PR91838 but I now doubt we can do that.
> >
> > 3. LLVM makes oversized vector shifts UB both for 'vector_size' and
> >'ext_vector_type'.
> 
> I had the impression GCC desired to do 3. as well, matching what we do
> for scalar shifts.
> 
> > 4. Vector lowering does not emit promotions, and starting from gcc-12
> >ranger treats oversized shifts according to the documentation you
> >cite below, and optimizes (e.g. with '-O2 -mno-sse')
> >
> > typedef short v8hi __attribute__((vector_size(16)));
> >
> > void f(v8hi *p)
> > {
> > *p >>= 16;
> > }
> >
> >to zeroing '*p'. If this looks unintended, I can file a bug.
> >
> > I still think we need to clarify semantics of vector shifts, but probably
> > not in the way I proposed initially. What do you think?
> 
> I think the intent at some point was to adhere to the OpenCL spec
> for the GCC vector extension (because that's a written spec while
> GCCs vector extension docs are lacking).  Originally the powerpc
> altivec 'vector' keyword spurred most of the development IIRC
> so it might be useful to see how they specify shifts.

It doesn't look like they document the semantics of '<<' and '>>'
operators for vector types.

> So yes, we probably should clarify the semantics to match the
> implementation (since we have two targets doing things differently
> since forever we can only document it as UB) and also note the
> difference from OpenCL (in case OpenCL is still relevant these
> days we might want to offer a -fopencl-vectors to emit the required
> AND).

It doesn't have to be UB, in principle we could say that shift amount
is taken modulo some power of two depending on the target without UB.
But since LLVM already treats that as UB, we might as well follow.

I think for addition/multiplication of signed vectors everybody
expects them to have wrapping semantics without UB on overflow though?

Revised patch below.

> It would be also good to amend the RTL documentation.
> 
> It would be very nice to start an internals documentation section
> around collecting what the middle-end considers undefined
> or implementation defined (aka target defined) behavior in the
> GENERIC, GIMPLE and RTL ILs and what predicates eventually
> control that (like TYPE_OVERFLOW_UNDEFINED).  Maybe spread it over
> {gimple,generic,rtl}.texi, though gimple.texi is only about the representation
> and all semantics are shared and documented in generic.texi.

Hm, noted. Thanks.

---8<---

>From e4e8d9e262f2f8dbc91a94291cf7accb74d27e7c Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Wed, 24 May 2023 15:48:29 +0300
Subject: [PATCH] doc: clarify semantics of vector bitwise shifts

Explicitly say that attempted shift past element bit width is UB for
vector types.  Mention that integer promotions do not happen.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..3723cfe467 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,14 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  Unlike OpenCL, values of @code{b} are not
+implicitly taken modulo bit width of the base type @code{B}, and the behavior
+is undefined if any @code{bi} is greater than or equal to @code{B}.
+
+In contrast to scalar operations in C and 

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-30 Thread Alexander Monakov via Gcc-patches


On Thu, 25 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  wrote:
> >
> >
> > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> >
> > > I’d have to check the ISAs what they actually do here - it of course 
> > > depends
> > > on RTL semantics as well but as you say those are not strictly defined 
> > > here
> > > either.
> >
> > Plus, we can add the following executable test to the testsuite:
> 
> Yeah, that's probably a good idea.  I think your documentation change
> with the added sentence about the truncation is OK.

I am no longer confident in my patch, sorry.

My claim about vector shift semantics in OpenCL was wrong. In fact it specifies
that RHS of a vector shift is masked to the exact bitwidth of the element type.

So, to collect various angles:

1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).

2. From user side we had a request to follow C integer promotion semantics
   in https://gcc.gnu.org/PR91838 but I now doubt we can do that.

3. LLVM makes oversized vector shifts UB both for 'vector_size' and
   'ext_vector_type'.

4. Vector lowering does not emit promotions, and starting from gcc-12
   ranger treats oversized shifts according to the documentation you
   cite below, and optimizes (e.g. with '-O2 -mno-sse')

typedef short v8hi __attribute__((vector_size(16)));

void f(v8hi *p)
{
*p >>= 16;
}

   to zeroing '*p'. If this looks unintended, I can file a bug.

I still think we need to clarify semantics of vector shifts, but probably
not in the way I proposed initially. What do you think?

Thanks.
Alexander

> Note we have
> 
> /* Shift operations for shift and rotate.
>Shift means logical shift if done on an
>unsigned type, arithmetic shift if done on a signed type.
>The second operand is the number of bits to
>shift by; it need not be the same type as the first operand and result.
>Note that the result is undefined if the second operand is larger
>than or equal to the first operand's type size.
> 
>The first operand of a shift can have either an integer or a
>(non-integer) fixed-point type.  We follow the ISO/IEC TR 18037:2004
>semantics for the latter.
> 
>Rotates are defined for integer types only.  */
> DEFTREECODE (LSHIFT_EXPR, "lshift_expr", tcc_binary, 2)
> 
> in tree.def which implies short << 24 is undefined behavior (similar
> wording in generic.texi).  The rtl docs say nothing about behavior
> but I think the semantics should carry over.  That works for x86
> even for scalar instructions working on GPRs (masking is applied
> but fixed to 5 or 6 bits even for QImode or HImode shifts).
> 
> Note that when we make these shifts well-defined there's
> also arithmetic on signed types smaller than int (which again
> doesn't exist in C) where overflow invokes undefined behavior
> in the middle-end.  Unless we want to change that as well
> this is somewhat inconsistent then.
> 
> There's also the issue that C 'int' is defined by INT_TYPE_SIZE
> and thus target dependent which makes what is undefined and
> what not target dependent.
> 
> Richard.
> 
> > #include 
> >
> > #define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
> > { \
> > typedef TYPE vec __attribute__((vector_size(WIDTH))); \
> >   \
> > static volatile vec zero; \
> > vec tmp = (zero-2) OP (COUNT);\
> > vec ref = INVERT zero;\
> > if (__builtin_memcmp(, , sizeof tmp)) \
> > __builtin_abort();\
> > }
> >
> > int main(void)
> > {
> > CHECK( uint8_t, 16, <<, 8,  )
> > CHECK( uint8_t, 16, <<, 31, )
> > CHECK( uint8_t, 16, >>, 8,  )
> > CHECK( uint8_t, 16, >>, 31, )
> > CHECK(  int8_t, 16, <<, 8,  )
> > CHECK(  int8_t, 16, <<, 31, )
> > CHECK(  int8_t, 16, >>, 8,  ~)
> > CHECK(  int8_t, 16, >>, 31, ~)
> > CHECK(uint16_t, 16, <<, 16, )
> > CHECK(uint16_t, 16, <<, 31, )
> > CHECK(uint16_t, 16, >>, 16, )
> > CHECK(uint16_t, 16, >>, 31, )
> > CHECK( int16_t, 16, <<, 16, )
> > CHECK( int16_t, 16, <<, 31, )
> > CHECK( int16_t, 16, >>, 16, ~)
> > CHECK( int16_t, 16, >>, 31, ~)
> > // Per-lane-variable shifts:
> > CHECK( uint8_t, 16, <<, zero+8,  )
> > CHECK( uint8_t, 16, <<, zero+31, )
> > CHECK( uint8_t, 16, >>, zero+8,  )
> > CHECK( uint8_t, 16, >>, zero+31, )
> > CHECK(  int8_t, 16, <<, zero+8,  )
> > CHECK(  int8_t, 16, <<, zero+31, )
> > CHECK(  int8_t, 16, >>, zero+8,  ~)
> > CHECK(  int8_t, 16, >>, zero+31, ~)
> > CHECK(uint16_t, 16, <<, zero+16, )
> > CHECK(uint16_t, 16, <<, zero+31, )
> > CHECK(uint16_t, 16, >>, zero+16, 

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches


On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:

> I’d have to check the ISAs what they actually do here - it of course depends
> on RTL semantics as well but as you say those are not strictly defined here
> either.

Plus, we can add the following executable test to the testsuite:

#include 

#define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
{ \
typedef TYPE vec __attribute__((vector_size(WIDTH))); \
  \
static volatile vec zero; \
vec tmp = (zero-2) OP (COUNT);\
vec ref = INVERT zero;\
if (__builtin_memcmp(, , sizeof tmp)) \
__builtin_abort();\
}

int main(void)
{
CHECK( uint8_t, 16, <<, 8,  )
CHECK( uint8_t, 16, <<, 31, )
CHECK( uint8_t, 16, >>, 8,  )
CHECK( uint8_t, 16, >>, 31, )
CHECK(  int8_t, 16, <<, 8,  )
CHECK(  int8_t, 16, <<, 31, )
CHECK(  int8_t, 16, >>, 8,  ~)
CHECK(  int8_t, 16, >>, 31, ~)
CHECK(uint16_t, 16, <<, 16, )
CHECK(uint16_t, 16, <<, 31, )
CHECK(uint16_t, 16, >>, 16, )
CHECK(uint16_t, 16, >>, 31, )
CHECK( int16_t, 16, <<, 16, )
CHECK( int16_t, 16, <<, 31, )
CHECK( int16_t, 16, >>, 16, ~)
CHECK( int16_t, 16, >>, 31, ~)
// Per-lane-variable shifts:
CHECK( uint8_t, 16, <<, zero+8,  )
CHECK( uint8_t, 16, <<, zero+31, )
CHECK( uint8_t, 16, >>, zero+8,  )
CHECK( uint8_t, 16, >>, zero+31, )
CHECK(  int8_t, 16, <<, zero+8,  )
CHECK(  int8_t, 16, <<, zero+31, )
CHECK(  int8_t, 16, >>, zero+8,  ~)
CHECK(  int8_t, 16, >>, zero+31, ~)
CHECK(uint16_t, 16, <<, zero+16, )
CHECK(uint16_t, 16, <<, zero+31, )
CHECK(uint16_t, 16, >>, zero+16, )
CHECK(uint16_t, 16, >>, zero+31, )
CHECK( int16_t, 16, <<, zero+16, )
CHECK( int16_t, 16, <<, zero+31, )
CHECK( int16_t, 16, >>, zero+16, ~)
CHECK( int16_t, 16, >>, zero+31, ~)

// Repeat for WIDTH=32 and WIDTH=64
}

Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches


On Wed, 24 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Explicitly say that bitwise shifts for narrow types work similar to
> > element-wise C shifts with integer promotions, which coincides with
> > OpenCL semantics.
> 
> Do we need to clarify that v << w with v being a vector of shorts
> still yields a vector of shorts and not a vector of ints?

I don't think so, but if necessary we could add "and the result was
truncated back to the base type":

When the base type is narrower than @code{int}, element-wise shifts
are performed as if operands underwent C integer promotions, and
the result was truncated back to the base type, like in OpenCL. 

> Btw, I don't see this promotion reflected in the IL.  For
> 
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi foo (v8hi a, v8hi b)
> {
>   return a << b;
> }
> 
> I get no masking of 'b' and vector lowering if the target doens't handle it
> yields
> 
>   short int _5;
>   short int _6;
> 
>   _5 = BIT_FIELD_REF ;
>   _6 = BIT_FIELD_REF ;
>   _7 = _5 << _6;
> 
> which we could derive ranges from for _6 (apparantly we don't yet).

Here it depends on how we define the GIMPLE-level semantics of bit-shift
operators for narrow types. To avoid changing lowering we could say that
shifting by up to 31 bits is well-defined for narrow types.

RTL-level semantics are also undocumented, unfortunately.

> Even
> 
> typedef int v8hi __attribute__((vector_size(16)));
> 
> v8hi x;
> int foo (v8hi a, v8hi b)
> {
>   x = a << b;
>   return (b[0] > 33);
> }
> 
> isn't optimized currently (but could - note I've used 'int' elements here).

Yeah. But let's constrain the optimizations first.

> So, I don't see us making sure the hardware does the right thing for
> out-of bound values.

I think in practice it worked out even if GCC did not pay attention to it,
because SIMD instructions had to facilitate autovectorization for C with
corresponding shift semantics.

Alexander

> 
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * doc/extend.texi (Vector Extensions): Clarify bitwise shift
> > semantics.
> > ---
> >  gcc/doc/extend.texi | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> > index e426a2eb7d..6b4e94b6a1 100644
> > --- a/gcc/doc/extend.texi
> > +++ b/gcc/doc/extend.texi
> > @@ -12026,7 +12026,12 @@ elements in the operand.
> >  It is possible to use shifting operators @code{<<}, @code{>>} on
> >  integer-type vectors. The operation is defined as following: @code{@{a0,
> >  a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
> > -@dots{}, an >> bn@}}@. Vector operands must have the same number of
> > +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
> > +element-wise shifts are performed as if operands underwent C integer
> > +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
> > +well-defined for vectors with @code{char} and @code{short} base types.
> > +
> > +Operands of binary vector operations must have the same number of
> >  elements.
> >
> >  For convenience, it is allowed to use a binary vector operation
> > --
> > 2.39.2
> >
> 


[PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches
Explicitly say that bitwise shifts for narrow types work similar to
element-wise C shifts with integer promotions, which coincides with
OpenCL semantics.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..6b4e94b6a1 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,12 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
+element-wise shifts are performed as if operands underwent C integer
+promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
+well-defined for vectors with @code{char} and @code{short} base types.
+
+Operands of binary vector operations must have the same number of
 elements. 
 
 For convenience, it is allowed to use a binary vector operation
-- 
2.39.2



Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-23 Thread Alexander Monakov via Gcc-patches


On Tue, 23 May 2023, Richard Biener wrote:
> > Ah, no, I deliberately decided against that, because that way we would go
> > via gimplify_arg, which would emit all side effects in *pre_p. That seems
> > wrong if arguments had side-effects that should go in *post_p.
> 
> Ah, true - that warrants a comment though.

Incrementally fixed up in my tree like this:

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index f7635d3b0c..17b0610a89 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -803,6 +803,7 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
else
  ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
  }
+   /* Avoid gimplify_arg: it emits all side effects into *PRE_P.  */
for (auto & : ops)
  if (gimplify_expr (, pre_p, post_p, is_gimple_val, fb_rvalue)
  == GS_ERROR)

Alexander


Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-22 Thread Alexander Monakov via Gcc-patches


On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.

No, the current documentation is incomplete, and that hunk extends it
to match the current GCC behavior. Should I break it out to a separate
patch? I see this drive-by fix could look confusing — sorry about that.

> it would be possible to do
> 
>   *expr_p = build_call_expr_internal (ifn, type, ops[0], ops[1]. ops[2]);
>   return GS_OK;
> 
> and not worry about temporary creation and gimplifying of the operands.
> That would in theory also leave the possibility to do this during
> genericization instead (and avoid the guard against late invocation of
> the hook).

Ah, no, I deliberately decided against that, because that way we would go
via gimplify_arg, which would emit all side effects in *pre_p. That seems
wrong if arguments had side-effects that should go in *post_p.

Thanks.
Alexander

> Otherwise it looks OK, but I'll let frontend maintainers have a chance to look
> as well.
> 
> Thanks for tackling this long-standing issue.
> Richard.


[PATCH] c-family: implement -ffp-contract=on

2023-05-18 Thread Alexander Monakov via Gcc-patches
Implement -ffp-contract=on for C and C++ without changing default
behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).

gcc/c-family/ChangeLog:

* c-gimplify.cc (fma_supported_p): New helper.
(c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
contraction.

gcc/ChangeLog:

* common.opt (fp_contract_mode) [on]: Remove fallback.
* config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
* doc/invoke.texi (-ffp-contract): Update.
* trans-mem.cc (diagnose_tm_1): Skip internal function calls.
---
 gcc/c-family/c-gimplify.cc | 78 ++
 gcc/common.opt |  3 +-
 gcc/config/sh/sh.md|  2 +-
 gcc/doc/invoke.texi|  8 ++--
 gcc/trans-mem.cc   |  3 ++
 5 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index ef5c7d919f..f7635d3b0c 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "c-ubsan.h"
 #include "tree-nested.h"
 #include "context.h"
+#include "tree-pass.h"
+#include "internal-fn.h"
 
 /*  The gimplification pass converts the language-dependent trees
 (ld-trees) emitted by the parser into language-independent trees
@@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree body)
   return bind;
 }
 
+/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
+
+static bool
+fma_supported_p (enum internal_fn fn, tree type)
+{
+  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
+}
+
 /* Gimplification of expression trees.  */
 
 /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
@@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
break;
   }
 
+case PLUS_EXPR:
+case MINUS_EXPR:
+  {
+   tree type = TREE_TYPE (*expr_p);
+   /* For -ffp-contract=on we need to attempt FMA contraction only
+  during initial gimplification.  Late contraction across statement
+  boundaries would violate language semantics.  */
+   if (SCALAR_FLOAT_TYPE_P (type)
+   && flag_fp_contract_mode == FP_CONTRACT_ON
+   && cfun && !(cfun->curr_properties & PROP_gimple_any)
+   && fma_supported_p (IFN_FMA, type))
+ {
+   bool neg_mul = false, neg_add = code == MINUS_EXPR;
+
+   tree *op0_p = _OPERAND (*expr_p, 0);
+   tree *op1_p = _OPERAND (*expr_p, 1);
+
+   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR
+   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
+ /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
+   else if (TREE_CODE (*op0_p) != MULT_EXPR)
+ {
+   std::swap (op0_p, op1_p);
+   std::swap (neg_mul, neg_add);
+ }
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
+ {
+   op0_p = _OPERAND (*op0_p, 0);
+   neg_mul = !neg_mul;
+ }
+   if (TREE_CODE (*op0_p) != MULT_EXPR)
+ break;
+   auto_vec ops (3);
+   ops.quick_push (TREE_OPERAND (*op0_p, 0));
+   ops.quick_push (TREE_OPERAND (*op0_p, 1));
+   ops.quick_push (*op1_p);
+
+   enum internal_fn ifn = IFN_FMA;
+   if (neg_mul)
+ {
+   if (fma_supported_p (IFN_FNMA, type))
+ ifn = IFN_FNMA;
+   else
+ ops[0] = build1 (NEGATE_EXPR, type, ops[0]);
+ }
+   if (neg_add)
+ {
+   enum internal_fn ifn2 = ifn == IFN_FMA ? IFN_FMS : IFN_FNMS;
+   if (fma_supported_p (ifn2, type))
+ ifn = ifn2;
+   else
+ ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
+ }
+   for (auto & : ops)
+ if (gimplify_expr (, pre_p, post_p, is_gimple_val, fb_rvalue)
+ == GS_ERROR)
+   return GS_ERROR;
+
+   gcall *call = gimple_build_call_internal_vec (ifn, ops);
+   gimple_seq_add_stmt_without_update (pre_p, call);
+   *expr_p = create_tmp_var (type);
+   gimple_call_set_lhs (call, *expr_p);
+   return GS_ALL_DONE;
+ }
+   break;
+  }
+
 default:;
 }
 
diff --git a/gcc/common.opt b/gcc/common.opt
index a28ca13385..3daec85aef 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1662,9 +1662,8 @@ Name(fp_contract_mode) Type(enum fp_contract_mode) 
UnknownError(unknown floating
 EnumValue
 Enum(fp_contract_mode) String(off) Value(FP_CONTRACT_OFF)
 
-; Not implemented, fall back to conservative FP_CONTRACT_OFF.
 EnumValue
-Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_OFF)
+Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_ON)
 
 EnumValue
 

[committed] tree-ssa-math-opts: correct -ffp-contract= check

2023-05-17 Thread Alexander Monakov via Gcc-patches
Since tree-ssa-math-opts may freely contract across statement boundaries
we should enable it only for -ffp-contract=fast instead of disabling it
for -ffp-contract=off.

No functional change, since -ffp-contract=on is not exposed yet.

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma): Enable only for
FP_CONTRACT_FAST (no functional change).
---

Preapproved in PR 106092, pushed to trunk.

 gcc/tree-ssa-math-opts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index b58a2ac9e6..d71c51dc0e 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3320,7 +3320,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
   imm_use_iterator imm_iter;
 
   if (FLOAT_TYPE_P (type)
-  && flag_fp_contract_mode == FP_CONTRACT_OFF)
+  && flag_fp_contract_mode != FP_CONTRACT_FAST)
 return false;
 
   /* We don't want to do bitfield reduction ops.  */
-- 
2.39.2



Re: More C type errors by default for GCC 14

2023-05-16 Thread Alexander Monakov via Gcc


On Tue, 16 May 2023, Florian Weimer wrote:

> > (FWIW: no, this should not be an error, a warning is fine, and I actually 
> > think the current state of it not being in Wall is the right thing as 
> > well)

(this is mixed up, -Wpointer-sign is in fact enabled by -Wall)

> I don't understand why we do not warn by default and warn with -Wall.  I
> would expect this to be either a documented extension (no warning with
> -Wall), or a warning by default (because it's a conformance issue).  Is
> there any conformance issue that is treated in the same way?

Another one is -Wpointer-arith (pointer arithmetic on 'void *').

Alexander


Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sun, 14 May 2023, Andrew Pinski wrote:

> It is NOT a dummy iterator. SIGNBIT is a operator list that expands to
> "BUILT_IN_SIGNBITF  BUILT_IN_SIGNBIT BUILT_IN_SIGNBITL IFN_SIGNBIT".

Ah, it's in cfn-operators.pd in the build tree, not the source tree.

> > On the other hand, the following clauses both use SIGNBIT directly, and
> > it would be nice to be consistent.
> 
> You cannot use the operator list directly if you have a for loop
> expansion too. So it is internally consistent already.

I see. Wasn't aware of the limitation.

Thanks.
Alexander


Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sun, 14 May 2023, Alexander Monakov wrote:

> On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:
> 
> > +/* signbit(x) != 0 ? -x : x -> abs(x)
> > +   signbit(x) == 0 ? -x : x -> -abs(x) */
> > +(for sign (SIGNBIT)
> 
> Surprised to see a dummy iterator here. Was this meant to include
> float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

On the other hand, the following clauses both use SIGNBIT directly, and
it would be nice to be consistent.

> > + (for neeq (ne eq)
> > +  (simplify
> > +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> > +(if (neeq == NE_EXPR)
> > + (abs @0)
> > + (negate (abs @0))
> > +
> >  (simplify
> >   /* signbit(x) -> 0 if x is nonnegative.  */
> >   (SIGNBIT tree_expr_nonnegative_p@0)
> 
> Thanks.
> Alexander
> 


Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches


On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:

> +/* signbit(x) != 0 ? -x : x -> abs(x)
> +   signbit(x) == 0 ? -x : x -> -abs(x) */
> +(for sign (SIGNBIT)

Surprised to see a dummy iterator here. Was this meant to include
float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

> + (for neeq (ne eq)
> +  (simplify
> +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> +(if (neeq == NE_EXPR)
> + (abs @0)
> + (negate (abs @0))
> +
>  (simplify
>   /* signbit(x) -> 0 if x is nonnegative.  */
>   (SIGNBIT tree_expr_nonnegative_p@0)

Thanks.
Alexander


Re: More C type errors by default for GCC 14

2023-05-12 Thread Alexander Monakov via Gcc


On Fri, 12 May 2023, Gabriel Ravier via Gcc wrote:

> On 5/12/23 19:52, Florian Weimer wrote:
> > I think we have another problem.
> >
> > We do not warn by default for:
> >
> >int x;
> >unsigned *p;
> >
> >p = 
> >
> > Isn't that a conformance issue because the pointers are incompatible,
> > requiring a diagnostic?
> >
> > Furthermore, Unlike the char case, this tends to introduce
> > strict-aliasing violations, so there is a good reason to treat this
> > variant as an error (even if we would only warn for char * and
> > unsigned char *).
> 
> Isn't this allowed by the standard ? 6.5.7. Expressions states:
> > An object shall have its stored value accessed only by an lvalue 
> expression that has one of thefollowing types:[...] - a type that is the
> signed or unsigned type corresponding to the effective type of the object

The standard allows aliasing, but not assigning pointers without a cast.

This is valid:

  unsigned x;

  int *p = (void *)

  *p = 0;

This is not valid (constraint violation):

  unsigned x;

  int *p = 

In GCC this is diagnosed under -Wpointer-sign:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25892

Alexander


[PATCH 1/3] genmatch: clean up emit_func

2023-05-08 Thread Alexander Monakov via Gcc-patches
Eliminate boolean parameters of emit_func. The first ('open') just
prints 'extern' to generated header, which is unnecessary. Introduce a
separate function to use when finishing a declaration in place of the
second ('close').

Rename emit_func to 'fp_decl' (matching 'fprintf' in length) to unbreak
indentation in several places.

Reshuffle emitted line breaks in a few places to make generated
declarations less ugly.

gcc/ChangeLog:

* genmatch.cc (header_file): Make static.
(emit_func): Rename to...
(fp_decl): ... this.  Adjust all uses.
(fp_decl_done): New function.  Use it...
(decision_tree::gen): ... here and...
(write_predicate): ... here.
(main): Adjust.
---
 gcc/genmatch.cc | 97 ++---
 1 file changed, 52 insertions(+), 45 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index c593814871..d5e56e2d68 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -183,31 +183,37 @@ fprintf_indent (FILE *f, unsigned int indent, const char 
*format, ...)
   va_end (ap);
 }
 
-/* Like fprintf, but print to two files, one header one C implementation.  */
-FILE *header_file = NULL;
+/* Secondary stream for fp_decl.  */
+static FILE *header_file;
 
+/* Start or continue emitting a declaration in fprintf-like manner,
+   printing both to F and global header_file, if non-null.  */
 static void
 #if GCC_VERSION >= 4001
-__attribute__((format (printf, 4, 5)))
+__attribute__((format (printf, 2, 3)))
 #endif
-emit_func (FILE *f, bool open, bool close, const char *format, ...)
+fp_decl (FILE *f, const char *format, ...)
 {
-  va_list ap1, ap2;
-  if (header_file != NULL)
-{
-  if (open)
-   fprintf (header_file, "extern ");
-  va_start (ap2, format);
-  vfprintf (header_file, format, ap2);
-  va_end (ap2);
-  if (close)
-   fprintf (header_file, ";\n");
-}
+  va_list ap;
+  va_start (ap, format);
+  vfprintf (f, format, ap);
+  va_end (ap);
 
-  va_start (ap1, format);
-  vfprintf (f, format, ap1);
-  va_end (ap1);
-  fputc ('\n', f);
+  if (!header_file)
+return;
+
+  va_start (ap, format);
+  vfprintf (header_file, format, ap);
+  va_end (ap);
+}
+
+/* Finish a declaration being emitted by fp_decl.  */
+static void
+fp_decl_done (FILE *f, const char *trailer)
+{
+  fprintf (f, "%s\n", trailer);
+  if (header_file)
+fprintf (header_file, "%s;", trailer);
 }
 
 static void
@@ -3924,35 +3930,35 @@ decision_tree::gen (vec  , bool gimple)
   s->fname = xasprintf ("%s_simplify_%u", gimple ? "gimple" : "generic",
fcnt++);
   if (gimple)
-   emit_func (f, true, false, "\nbool\n"
+   fp_decl (f, "\nbool\n"
 "%s (gimple_match_op *res_op, gimple_seq *seq,\n"
 " tree (*valueize)(tree) ATTRIBUTE_UNUSED,\n"
 " const tree ARG_UNUSED (type), tree 
*ARG_UNUSED "
-"(captures)\n",
+"(captures)",
 s->fname);
   else
{
- emit_func (f, true, false, "\ntree\n"
+ fp_decl (f, "\ntree\n"
   "%s (location_t ARG_UNUSED (loc), const tree ARG_UNUSED 
(type),\n",
   (*iter).second->fname);
  for (unsigned i = 0;
   i < as_a (s->s->s->match)->ops.length (); ++i)
-   emit_func (f, false, false, " tree ARG_UNUSED (_p%d),", i);
- emit_func (f, false, false, " tree *captures\n");
+   fp_decl (f, " tree ARG_UNUSED (_p%d),", i);
+ fp_decl (f, " tree *captures");
}
   for (unsigned i = 0; i < s->s->s->for_subst_vec.length (); ++i)
{
  if (! s->s->s->for_subst_vec[i].first->used)
continue;
  if (is_a  (s->s->s->for_subst_vec[i].second))
-   emit_func (f, false, false, ", const enum tree_code ARG_UNUSED 
(%s)",
+   fp_decl (f, ",\n const enum tree_code ARG_UNUSED (%s)",
 s->s->s->for_subst_vec[i].first->id);
  else if (is_a  (s->s->s->for_subst_vec[i].second))
-   emit_func (f, false, false, ", const combined_fn ARG_UNUSED (%s)",
+   fp_decl (f, ",\n const combined_fn ARG_UNUSED (%s)",
 s->s->s->for_subst_vec[i].first->id);
}
 
-  emit_func (f, false, true, ")");
+  fp_decl_done (f, ")");
   fprintf (f, "{\n");
   fprintf_indent (f, 2, "const bool debug_dump = "
"dump_file && (dump_flags & TDF_FOLDING);\n");
@@ -3988,22 +3994,22 @@ decision_tree::gen (vec  , bool gimple)
  FILE *f = get_out_file (files);
 
  if (gimple)
-   emit_func (f, true, false,"\nbool\n"
+   fp_decl (f, "\nbool\n"
 "gimple_simplify_%s (gimple_match_op *res_op,"
 " gimple_seq *seq,\n"
 " tree (*valueize)(tree) "
 "ATTRIBUTE_UNUSED,\n"
 "  

[PATCH 3/3] genmatch: fixup get_out_file

2023-05-08 Thread Alexander Monakov via Gcc-patches
get_out_file did not follow the coding conventions (mixing three-space
and two-space indentation, missing linebreak before function name).

Take that as an excuse to reimplement it in a more terse manner and
rename as 'choose_output', which is hopefully more descriptive.

gcc/ChangeLog:

* genmatch.cc (get_out_file): Make static and rename to ...
(choose_output): ... this. Reimplement. Update all uses ...
(decision_tree::gen): ... here and ...
(main): ... here.
---
 gcc/genmatch.cc | 41 +
 1 file changed, 17 insertions(+), 24 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index baf93855a6..177c13d87c 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -255,28 +255,21 @@ output_line_directive (FILE *f, location_t location,
 
 #define SIZED_BASED_CHUNKS 1
 
-int current_file = 0;
-FILE *get_out_file (vec  )
+static FILE *
+choose_output (const vec )
 {
 #ifdef SIZED_BASED_CHUNKS
-   if (parts.length () == 1)
- return parts[0];
-
-   FILE *f = NULL;
-   long min = 0;
-   /* We've started writing all the files at pos 0, so ftell is equivalent
-  to the size and should be much faster.  */
-   for (unsigned i = 0; i < parts.length (); i++)
- {
-   long res = ftell (parts[i]);
-   if (!f || res < min)
- {
-   min = res;
-   f = parts[i];
- }
- }
-  return f;
+  FILE *shortest = NULL;
+  long min = 0;
+  for (FILE *part : parts)
+{
+  long len = ftell (part);
+  if (!shortest || min > len)
+   shortest = part, min = len;
+}
+  return shortest;
 #else
+  static int current_file;
   return parts[current_file++ % parts.length ()];
 #endif
 }
@@ -3924,7 +3917,7 @@ decision_tree::gen (vec  , bool gimple)
}
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (files);
+  FILE *f = choose_output (files);
 
   /* Generate a split out function with the leaf transform code.  */
   s->fname = xasprintf ("%s_simplify_%u", gimple ? "gimple" : "generic",
@@ -3991,7 +3984,7 @@ decision_tree::gen (vec  , bool gimple)
 
 
  /* Cycle the file buffers.  */
- FILE *f = get_out_file (files);
+ FILE *f = choose_output (files);
 
  if (gimple)
fp_decl (f, "\nbool\n"
@@ -4028,7 +4021,7 @@ decision_tree::gen (vec  , bool gimple)
{
 
  /* Cycle the file buffers.  */
- FILE *f = get_out_file (files);
+ FILE *f = choose_output (files);
 
  if (gimple)
fp_decl (f, "\nbool\n"
@@ -4053,7 +4046,7 @@ decision_tree::gen (vec  , bool gimple)
 
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (files);
+  FILE *f = choose_output (files);
 
   /* Then generate the main entry with the outermost switch and
  tail-calls to the split-out functions.  */
@@ -5461,7 +5454,7 @@ main (int argc, char **argv)
dt.print (stderr);
 
   /* Cycle the file buffers.  */
-  FILE *f = get_out_file (parts);
+  FILE *f = choose_output (parts);
 
   write_predicate (f, pred, dt, gimple);
 }
-- 
2.39.2



[PATCH 2/3] genmatch: clean up showUsage

2023-05-08 Thread Alexander Monakov via Gcc-patches
Display usage more consistently and get rid of camelCase.

gcc/ChangeLog:

* genmatch.cc (showUsage): Reimplement as ...
(usage): ...this.  Adjust all uses.
(main): Print usage when no arguments.  Add missing 'return 1'.
---
 gcc/genmatch.cc | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index d5e56e2d68..baf93855a6 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -5301,13 +5301,12 @@ round_alloc_size (size_t s)
 /* Construct and display the help menu.  */
 
 static void
-showUsage ()
+usage ()
 {
-  fprintf (stderr, "Usage: genmatch [--gimple] [--generic] "
-  "[--header=] [--include=] [-v[v]] input "
-  "[...]\n");
-  fprintf (stderr, "\nWhen more then one outputfile is specified --header "
-  "is required.\n");
+  const char *usage = "Usage:\n"
+" %s [--gimple|--generic] [-v[v]] \n"
+" %s [options] [--include=FILE] --header=FILE  ...\n";
+  fprintf (stderr, usage, progname, progname);
 }
 
 /* Write out the correct include to the match-head fle containing the helper
@@ -5332,9 +5331,6 @@ main (int argc, char **argv)
 
   progname = "genmatch";
 
-  if (argc < 2)
-return 1;
-
   bool gimple = true;
   char *s_header_file = NULL;
   char *s_include_file = NULL;
@@ -5359,14 +5355,17 @@ main (int argc, char **argv)
files.safe_push (argv[i]);
   else
{
- showUsage ();
+ usage ();
  return 1;
}
 }
 
   /* Validate if the combinations are valid.  */
   if ((files.length () > 1 && !s_header_file) || files.is_empty ())
-showUsage ();
+{
+  usage ();
+  return 1;
+}
 
   if (!s_include_file)
 s_include_file = s_header_file;
-- 
2.39.2



[PATCH 0/3] Trivial cleanups for genmatch

2023-05-08 Thread Alexander Monakov via Gcc-patches
I'm trying to study match.pd/genmatch with the eventual goal of
improving match-and-simplify code generation. Here's some trivial
cleanups for the recent refactoring in the meantime.

Alexander Monakov (3):
  genmatch: clean up emit_func
  genmatch: clean up showUsage
  genmatch: fixup get_out_file

 gcc/genmatch.cc | 159 
 1 file changed, 79 insertions(+), 80 deletions(-)

-- 
2.39.2



RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-08 Thread Alexander Monakov via Gcc-patches
On Fri, 5 May 2023, Alexander Monakov wrote:

> > > gimple-head-export.cc does not exist.
> > > 
> > > gimple-match-exports.cc is not a generated file. It's under source 
> > > control and
> > > edited independently from genmatch.cc. It is compiled separately, 
> > > producing
> > > gimple-match-exports.o.
> > > 
> > > gimple-match-head.cc is also not a generated file, also under source 
> > > control.
> > > It is transitively included into gimple-match-N.o files. If it changes, 
> > > they will be
> > > rebuilt. This is not changed by my patch.
> > > 
> > > gimple-match-auto.h is a generated file. It depends on s-gimple-match 
> > > stamp
> > > file, which in turn depends on genmatch and match.pd. If either changes, 
> > > the
> > > rule for the stamp file triggers. gimple-match-N.o files also depend on 
> > > the
> > > stamp file, so they will be rebuilt as well.
> > 
> > s-gimple-match does not depend on gimple-match-head.cc. if it changes the 
> > stamp
> > is not invalidated. 
> 
> Right, this is correct: there's no need to rerun the recipe for the stamp,
> because contents of gimple-match-head.cc do not affect it.
> 
> > This happens to work because gimple-match-N.cc does depend on 
> > gimple-match-head.cc,
> > but if the gimple-match-N.cc already exists then nothing changes.
> 
> No, if gimple-match-N.cc already exist, make notices they are out-of-date via
> 
> $(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true
> 
> and this triggers rebuilding gimple-match-N.o.
> 
> I tested this. After 'touch gimple-match-head.cc' all ten gimple-match-N.o 
> files
> are rebuilt.

My explanation was incomplete here. The gcc/Makefile.in rule quoted above
applies to .cc files and does not trigger rebuilds of .o files on its own.
The reason .o files get rebuilt is implicit dependency tracking: initial
build records header dependencies in gcc/.deps/*.Po files, and incremental
rebuild sees that gimple-match-1.o depends on gimple-match-head.cc.

Alexander


RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches


On Fri, 5 May 2023, Tamar Christina wrote:

> > -Original Message-
> > From: Alexander Monakov 
> > Sent: Friday, May 5, 2023 6:59 PM
> > To: Tamar Christina 
> > Cc: Richard Biener ; gcc-patches@gcc.gnu.org
> > Subject: RE: [PATCH] Makefile.in: clean up match.pd-related dependencies
> > 
> > 
> > On Fri, 5 May 2023, Tamar Christina wrote:
> > 
> > > > > Am 05.05.2023 um 19:03 schrieb Alexander Monakov via Gcc-patches
> > > > >  > > > patc...@gcc.gnu.org>:
> > > > >
> > > > > Clean up confusing changes from the recent refactoring for
> > > > > parallel match.pd build.
> > > > >
> > > > > gimple-match-head.o is not built. Remove related flags adjustment.
> > > > >
> > > > > Autogenerated gimple-match-N.o files do not depend on
> > > > > gimple-match-exports.cc.
> > > > >
> > > > > {gimple,generic)-match-auto.h only depend on the prerequisites of
> > > > > the corresponding s-{gimple,generic}-match stamp file, not any .cc 
> > > > > file.
> > > >
> > > > LGTM
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >* Makefile.in: (gimple-match-head.o-warn): Remove.
> > > > >(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
> > > > >gimple-match-exports.cc.
> > > > >(gimple-match-auto.h): Only depend on s-gimple-match.
> > > > >(generic-match-auto.h): Likewise.
> > > > > ---
> > > > >
> > > > > Tamar, do I understand correctly that you do not have more plans
> > > > > for match.pd and I won't collide with you if I attempt more
> > > > > cleanups in this
> > > > area? Thanks!
> > >
> > > No, but I'm also not sure why this change.
> > > The idea here was that if gimple-head-export.cc changes you must have
> > > changed genmatch.cc and so you need to regenerate the gimple-match-*
> > which could change the header.
> > 
> > gimple-head-export.cc does not exist.
> > 
> > gimple-match-exports.cc is not a generated file. It's under source control 
> > and
> > edited independently from genmatch.cc. It is compiled separately, producing
> > gimple-match-exports.o.
> > 
> > gimple-match-head.cc is also not a generated file, also under source 
> > control.
> > It is transitively included into gimple-match-N.o files. If it changes, 
> > they will be
> > rebuilt. This is not changed by my patch.
> > 
> > gimple-match-auto.h is a generated file. It depends on s-gimple-match stamp
> > file, which in turn depends on genmatch and match.pd. If either changes, the
> > rule for the stamp file triggers. gimple-match-N.o files also depend on the
> > stamp file, so they will be rebuilt as well.
> 
> s-gimple-match does not depend on gimple-match-head.cc. if it changes the 
> stamp
> is not invalidated. 

Right, this is correct: there's no need to rerun the recipe for the stamp,
because contents of gimple-match-head.cc do not affect it.

> This happens to work because gimple-match-N.cc does depend on 
> gimple-match-head.cc,
> but if the gimple-match-N.cc already exists then nothing changes.

No, if gimple-match-N.cc already exist, make notices they are out-of-date via

$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true

and this triggers rebuilding gimple-match-N.o.

I tested this. After 'touch gimple-match-head.cc' all ten gimple-match-N.o files
are rebuilt.

> So I don't think this changes anything. If anything I would say the stamp 
> file needs to
> depend on gimple-match-head.cc. 

Is my explanation above satisfactory?

Thanks.
Alexander

> 
> Thanks,
> Tamar
> 
> > 
> > Is there some problem I'm not seeing?
> > 
> > Thanks.
> > Alexander
> > 
> > > So not sure I agree with this.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > > >
> > > > > gcc/Makefile.in | 9 +++--
> > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in index
> > > > > 7e7ac078c5..0cc13c37d0 100644
> > > > > --- a/gcc/Makefile.in
> > > > > +++ b/gcc/Makefile.in
> > > > > @@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
> > > > > libgcov-util.o-warn = -Wno-error libgcov-driver-tool.o-warn =
> > > > > -Wno-error libgco

RE: [PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches


On Fri, 5 May 2023, Tamar Christina wrote:

> > > Am 05.05.2023 um 19:03 schrieb Alexander Monakov via Gcc-patches  > patc...@gcc.gnu.org>:
> > >
> > > Clean up confusing changes from the recent refactoring for parallel
> > > match.pd build.
> > >
> > > gimple-match-head.o is not built. Remove related flags adjustment.
> > >
> > > Autogenerated gimple-match-N.o files do not depend on
> > > gimple-match-exports.cc.
> > >
> > > {gimple,generic)-match-auto.h only depend on the prerequisites of the
> > > corresponding s-{gimple,generic}-match stamp file, not any .cc file.
> > 
> > LGTM
> > 
> > > gcc/ChangeLog:
> > >
> > >* Makefile.in: (gimple-match-head.o-warn): Remove.
> > >(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
> > >gimple-match-exports.cc.
> > >(gimple-match-auto.h): Only depend on s-gimple-match.
> > >(generic-match-auto.h): Likewise.
> > > ---
> > >
> > > Tamar, do I understand correctly that you do not have more plans for
> > > match.pd and I won't collide with you if I attempt more cleanups in this
> > area? Thanks!
> 
> No, but I'm also not sure why this change.
> The idea here was that if gimple-head-export.cc changes you must have changed
> genmatch.cc and so you need to regenerate the gimple-match-* which could 
> change the header.

gimple-head-export.cc does not exist.

gimple-match-exports.cc is not a generated file. It's under source control and
edited independently from genmatch.cc. It is compiled separately, producing
gimple-match-exports.o.

gimple-match-head.cc is also not a generated file, also under source control.
It is transitively included into gimple-match-N.o files. If it changes, they
will be rebuilt. This is not changed by my patch.

gimple-match-auto.h is a generated file. It depends on s-gimple-match stamp
file, which in turn depends on genmatch and match.pd. If either changes, the
rule for the stamp file triggers. gimple-match-N.o files also depend on the
stamp file, so they will be rebuilt as well.

Is there some problem I'm not seeing?

Thanks.
Alexander

> So not sure I agree with this.
> 
> Thanks,
> Tamar
> 
> > >
> > > gcc/Makefile.in | 9 +++--
> > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in index
> > > 7e7ac078c5..0cc13c37d0 100644
> > > --- a/gcc/Makefile.in
> > > +++ b/gcc/Makefile.in
> > > @@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
> > > libgcov-util.o-warn = -Wno-error libgcov-driver-tool.o-warn =
> > > -Wno-error libgcov-merge-tool.o-warn = -Wno-error
> > > -gimple-match-head.o-warn = -Wno-unused gimple-match-exports.o-warn
> > =
> > > -Wno-unused dfp.o-warn = -Wno-strict-aliasing
> > >
> > > @@ -2674,12 +2673,10 @@ s-tm-texi: build/genhooks$(build_exeext)
> > $(srcdir)/doc/tm.texi.in
> > >  false; \
> > >fi
> > >
> > > -$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc \
> > > -gimple-match-exports.cc; @true
> > > -gimple-match-auto.h: s-gimple-match gimple-match-head.cc \
> > > -gimple-match-exports.cc; @true
> > > +$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc;
> > > +@true
> > > +gimple-match-auto.h: s-gimple-match; @true
> > > $(GENERIC_MATCH_PD_SEQ_SRC): s-generic-match generic-match-head.cc;
> > > @true
> > > -generic-match-auto.h: s-generic-match generic-match-head.cc; @true
> > > +generic-match-auto.h: s-generic-match; @true
> > >
> > > s-gimple-match: build/genmatch$(build_exeext) \
> > >$(srcdir)/match.pd cfn-operators.pd
> > > --
> > > 2.39.2
> > >
> 


[PATCH] Makefile.in: clean up match.pd-related dependencies

2023-05-05 Thread Alexander Monakov via Gcc-patches
Clean up confusing changes from the recent refactoring for
parallel match.pd build.

gimple-match-head.o is not built. Remove related flags adjustment.

Autogenerated gimple-match-N.o files do not depend on
gimple-match-exports.cc.

{gimple,generic)-match-auto.h only depend on the prerequisites of the
corresponding s-{gimple,generic}-match stamp file, not any .cc file.

gcc/ChangeLog:

* Makefile.in: (gimple-match-head.o-warn): Remove.
(GIMPLE_MATCH_PD_SEQ_SRC): Do not depend on
gimple-match-exports.cc.
(gimple-match-auto.h): Only depend on s-gimple-match.
(generic-match-auto.h): Likewise.
---

Tamar, do I understand correctly that you do not have more plans for match.pd
and I won't collide with you if I attempt more cleanups in this area? Thanks!

 gcc/Makefile.in | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 7e7ac078c5..0cc13c37d0 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -230,7 +230,6 @@ gengtype-lex.o-warn = -Wno-error
 libgcov-util.o-warn = -Wno-error
 libgcov-driver-tool.o-warn = -Wno-error
 libgcov-merge-tool.o-warn = -Wno-error
-gimple-match-head.o-warn = -Wno-unused
 gimple-match-exports.o-warn = -Wno-unused
 dfp.o-warn = -Wno-strict-aliasing
 
@@ -2674,12 +2673,10 @@ s-tm-texi: build/genhooks$(build_exeext) 
$(srcdir)/doc/tm.texi.in
  false; \
fi
 
-$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc \
-   gimple-match-exports.cc; @true
-gimple-match-auto.h: s-gimple-match gimple-match-head.cc \
-   gimple-match-exports.cc; @true
+$(GIMPLE_MATCH_PD_SEQ_SRC): s-gimple-match gimple-match-head.cc; @true
+gimple-match-auto.h: s-gimple-match; @true
 $(GENERIC_MATCH_PD_SEQ_SRC): s-generic-match generic-match-head.cc; @true
-generic-match-auto.h: s-generic-match generic-match-head.cc; @true
+generic-match-auto.h: s-generic-match; @true
 
 s-gimple-match: build/genmatch$(build_exeext) \
$(srcdir)/match.pd cfn-operators.pd
-- 
2.39.2



[PATCH] do not tailcall __sanitizer_cov_trace_pc [PR90746]

2023-05-02 Thread Alexander Monakov via Gcc-patches
When instrumentation is requested via -fsanitize-coverage=trace-pc, GCC
emits calls to __sanitizer_cov_trace_pc callback into each basic block.
This callback is supposed to be implemented by the user, and should be
able to identify the containing basic block by inspecting its return
address. Tailcalling the callback prevents that, so disallow it.

gcc/ChangeLog:

PR sanitizer/90746
* calls.cc (can_implement_as_sibling_call_p): Reject calls
to __sanitizer_cov_trace_pc.

gcc/testsuite/ChangeLog:

PR sanitizer/90746
* gcc.dg/sancov/basic0.c: Verify absence of tailcall.
---
 gcc/calls.cc | 10 ++
 gcc/testsuite/gcc.dg/sancov/basic0.c |  4 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/calls.cc b/gcc/calls.cc
index 4d7f6c3d2..c6ed2f189 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -2541,6 +2541,16 @@ can_implement_as_sibling_call_p (tree exp,
   return false;
 }
 
+  /* __sanitizer_cov_trace_pc is supposed to inspect its return address
+ to identify the caller, and therefore should not be tailcalled.  */
+  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL
+  && DECL_FUNCTION_CODE (fndecl) == BUILT_IN_SANITIZER_COV_TRACE_PC)
+{
+  /* No need for maybe_complain_about_tail_call here: the call
+ is synthesized by the compiler.  */
+  return false;
+}
+
   /* If the called function is nested in the current one, it might access
  some of the caller's arguments, but could clobber them beforehand if
  the argument areas are shared.  */
diff --git a/gcc/testsuite/gcc.dg/sancov/basic0.c 
b/gcc/testsuite/gcc.dg/sancov/basic0.c
index af69b2d12..dfdaea848 100644
--- a/gcc/testsuite/gcc.dg/sancov/basic0.c
+++ b/gcc/testsuite/gcc.dg/sancov/basic0.c
@@ -1,9 +1,11 @@
 /* Basic test on number of inserted callbacks.  */
 /* { dg-do compile } */
-/* { dg-options "-fsanitize-coverage=trace-pc -fdump-tree-optimized" } */
+/* { dg-options "-fsanitize-coverage=trace-pc -fdump-tree-optimized 
-fdump-rtl-expand" } */
 
 void foo(void)
 {
 }
 
 /* { dg-final { scan-tree-dump-times "__builtin___sanitizer_cov_trace_pc 
\\(\\)" 1 "optimized" } } */
+/* The built-in should not be tail-called: */
+/* { dg-final { scan-rtl-dump-not "call_insn/j" "expand" } } */
-- 
2.39.2



[PATCH] haifa-sched: fix autopref_rank_for_schedule comparator [PR109187]

2023-03-28 Thread Alexander Monakov via Gcc-patches
Do not attempt to use a plain subtraction for generating a three-way
comparison result in autopref_rank_for_schedule qsort comparator, as
offsets are not restricted and subtraction may overflow.  Open-code
a safe three-way comparison instead.

gcc/ChangeLog:

PR rtl-optimization/109187
* haifa-sched.cc (autopref_rank_for_schedule): Avoid use of overflowing
subtraction in three-way comparison.

gcc/testsuite/ChangeLog:

PR rtl-optimization/109187
* gcc.dg/pr109187.c: New test.
---

I think I can commit this as obvious if no comment in a day, but explicit ack
is always appreciated.

Alexander

 gcc/haifa-sched.cc  | 2 +-
 gcc/testsuite/gcc.dg/pr109187.c | 8 
 2 files changed, 9 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr109187.c

diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
index 4efaa9445..e11cc5c35 100644
--- a/gcc/haifa-sched.cc
+++ b/gcc/haifa-sched.cc
@@ -5686,7 +5686,7 @@ autopref_rank_for_schedule (const rtx_insn *insn1, const 
rtx_insn *insn2)
 
   if (!irrel1 && !irrel2)
/* Sort memory references from lowest offset to the largest.  */
-   r = data1->offset - data2->offset;
+   r = (data1->offset > data2->offset) - (data1->offset < data2->offset);
   else if (write)
/* Schedule "irrelevant" insns before memory stores to resolve
   as many producer dependencies of stores as possible.  */
diff --git a/gcc/testsuite/gcc.dg/pr109187.c b/gcc/testsuite/gcc.dg/pr109187.c
new file mode 100644
index 0..1ef14a73d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr109187.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 --param sched-autopref-queue-depth=1" } */
+
+void f(int *a)
+{
+  for (;;)
+asm("" :: "r"(a[-0x1000]), "r"(a[0x1000]), "r"(a[0]) : "memory");
+}
-- 
2.39.1



Re: Should -ffp-contract=off the default on GCC?

2023-03-22 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Mar 2023, Jakub Jelinek via Gcc-patches wrote:

> On Mon, Mar 20, 2023 at 10:05:57PM +, Qing Zhao via Gcc-patches wrote:
> > My question: is the above section the place in C standard “explicitly 
> > allows contractions”? If not, where it is in C standard?
> 
> http://port70.net/%7Ensz/c/c99/n1256.html#6.5p8
> http://port70.net/%7Ensz/c/c99/n1256.html#note78
> http://port70.net/%7Ensz/c/c99/n1256.html#F.6

C only allows contractions within expressions, not across statements (i.e.
either -ffp-contract=on or -ffp-contract=off would be compliant, but not
our default -ffp-contract=fast).

Unrestricted contraction across statements together with other optimizations
gives rise to difficult-to-debug issues such as PR 106902.

Alexander


Re: Should -ffp-contract=off the default on GCC?

2023-03-22 Thread Alexander Monakov via Gcc-patches


On Wed, 22 Mar 2023, Richard Biener wrote:

> I think it's even less realistic to expect users to know the details of
> floating-point math.  So I doubt any such sentence will be helpful
> besides spreading some FUD?

I think it's closer to "fundamental notions" rather than "details". For
users who bother to read the GCC manual there's a decent chance it wouldn't
be for naught.

For documentation, I was thinking

  Together with -fexcess-precision=standard, -ffp-contract=off
  is necessary to ensure that rounding of intermediate results to precision
  implied by the source code and the FLT_EVAL_METHOD macro is not
  omitted by the compiler.

Alexander


Re: Should -ffp-contract=off the default on GCC?

2023-03-21 Thread Alexander Monakov via Gcc-patches


On Tue, 21 Mar 2023, Jeff Law via Gcc-patches wrote:

> On 3/21/23 11:00, Qing Zhao via Gcc-patches wrote:
> > 
> >> On Mar 21, 2023, at 12:56 PM, Paul Koning  wrote:
> >>
> >>> On Mar 21, 2023, at 11:01 AM, Qing Zhao via Gcc-patches
> >>>  wrote:
> >>>
> >>> ...
> >>> Most of the compiler users are not familiar with language standards, or no
> >>> access to language standards. Without clearly documenting such warnings
> >>> along with the option explicitly, the users have not way to know such
> >>> potential impact.
> >>
> >> With modern highly optimized languages, not knowing the standard is going
> >> to get you in trouble.  There was a wonderful paper from MIT a few years
> >> ago describing all the many ways C can bite you if you don't know the
> >> rules.
> > 
> > Yes, it’s better to know the details of languages standard. -:)
> > However, I don’t think that this is a realistic expectation to the compiler
> > users:  to know all the details of a language standard.
> Umm, they really do need to know that stuff.
> 
> If the developer fails to understand the language standard, then they're
> likely going to write code that is ultimately undefined or doesn't behave in
> they expect.  How is the compiler supposed to guess what the developer
> originally intended?  How should the compiler handle the case when two
> developers have different understandings of how a particular piece of code
> should work?  In the end it's the language standard that defines how all this
> stuff should work.
> 
> Failure to understand the language is a common problem and we do try to emit
> various diagnostics to help developers avoid writing non-conformant code.  But
> ultimately if a developer fails to understand the language standard, then
> they're going to be surprised by the behavior of their code.

W h a t.

This subthread concerns documenting the option better ("Without clearly
documenting such warnings ...").

Are you arguing against adding a brief notice to the documentation blurb for
the -ffp-contract= option?

Perplexed,
Alexander


Re: [RFC/PATCH] sched: Consider debug insn in no_real_insns_p [PR108273]

2023-03-20 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Mar 2023, Kewen.Lin wrote:

> Hi,

Hi. Thank you for the thorough analysis. Since I analyzed
PR108519, I'd like to offer my comments.

> As PR108273 shows, when there is one block which only has
> NOTE_P and LABEL_P insns at non-debug mode while has some
> extra DEBUG_INSN_P insns at debug mode, after scheduling
> it, the DFA states would be different between debug mode
> and non-debug mode.  Since at non-debug mode, the block
> meets no_real_insns_p, it gets skipped; while at debug
> mode, it gets scheduled, even it only has NOTE_P, LABEL_P
> and DEBUG_INSN_P, the call of function advance_one_cycle
> will change the DFA state.  PR108519 also shows this issue
> issue can be exposed by some scheduler changes.

(yes, so an alternative is to avoid extraneous advance_one_cycle
calls, but I think adjusting no_real_insns_p is preferable)

> This patch is to take debug insn into account in function
> no_real_insns_p, which make us not try to schedule for the
> block having only NOTE_P, LABEL_P and DEBUG_INSN_P insns,
> resulting in consistent DFA states between non-debug and
> debug mode.  Changing no_real_insns_p caused ICE when doing
> free_block_dependencies, the root cause is that we create
> dependencies for debug insns, those dependencies are
> expected to be resolved during scheduling insns which gets
> skipped after the change in no_real_insns_p.  By checking
> the code, it looks it's reasonable to skip to compute block
> dependencies for no_real_insns_p blocks.  It can be
> bootstrapped and regtested but it hit one ICE when built
> SPEC2017 bmks at option -O2 -g.  The root cause is that
> initially there are no no_real_insns_p blocks in a region,
> but in the later scheduling one block has one insn scheduled
> speculatively then becomes no_real_insns_p, so we compute
> dependencies and rgn_n_insns for this special block before
> scheduling, later it gets skipped so not scheduled, the
> following counts would mismatch:
> 
> /* Sanity check: verify that all region insns were scheduled.  */
>   gcc_assert (sched_rgn_n_insns == rgn_n_insns);
> 
> , and we miss to release the allocated dependencies.

Hm, but it is quite normal for BBs to become empty via speculative
scheduling in non-debug mode as well. So I don't think it's the
right way to frame the problem.

I think the main issue here is that debug_insns are "not real insns"
except we add them together with normal insns in the dependency graph,
and then we verify that the graph was exhausted by the scheduler.

We already handle a situation when dbg_cnt is telling the scheduler
to skip blocks. I guess the dbg_cnt handling is broken for a similar
reason?

Can we fix this issue together with the debug_cnt issue by adjusting
dbg_cnt handling in schedule_region, i.e. if no_real_insns_p || !dbg_cnt
then adjust sched_rgn_n_insns and manually resolve+free dependencies?

> To avoid the unexpected mis-matchings, this patch adds one
> bitmap to track this kind of special block which isn't
> no_real_insns_p but becomes no_real_insns_p later, then we
> can adjust the count and free deps for it.

Per above, I hope a simpler solution is possible.

(some comments on the patch below)

> This patch can be bootstrapped and regress-tested on
> x86_64-redhat-linux, aarch64-linux-gnu and
> powerpc64{,le}-linux-gnu.
> 
> I also verified this patch can pass SPEC2017 both intrate
> and fprate bmks building at -g -O2/-O3.
> 
> This is for next stage 1, but since I know little on the
> scheduler, I'd like to post it early for more comments.
> 
> Is it on the right track?  Any thoughts?
> 
> BR,
> Kewen
> -
>   PR rtl-optimization/108273
> 
> gcc/ChangeLog:
> 
>   * haifa-sched.cc (no_real_insns_p): Consider DEBUG_INSN_P insn.
>   * sched-rgn.cc (no_real_insns): New static bitmap variable.
>   (compute_block_dependences): Skip for no_real_insns_p.
>   (free_deps_for_bb_no_real_insns_p): New function.
>   (free_block_dependencies): Call free_deps_for_bb_no_real_insns_p for
>   no_real_insns_p bb.
>   (schedule_region): Fix up sched_rgn_n_insns for some block for which
>   rgn_n_insns is computed before, and move sched_rgn_local_finish after
>   free_block_dependencies loop.
>   (sched_rgn_local_init): Allocate and compute no_real_insns.
>   (sched_rgn_local_free): Free no_real_insns.
> ---
>  gcc/haifa-sched.cc |  8 -
>  gcc/sched-rgn.cc   | 84 +++---
>  2 files changed, 87 insertions(+), 5 deletions(-)
> 
> diff --git a/gcc/haifa-sched.cc b/gcc/haifa-sched.cc
> index 48b53776fa9..378f3b34cc0 100644
> --- a/gcc/haifa-sched.cc
> +++ b/gcc/haifa-sched.cc
> @@ -5040,7 +5040,13 @@ no_real_insns_p (const rtx_insn *head, const rtx_insn 
> *tail)
>  {
>while (head != NEXT_INSN (tail))
>  {
> -  if (!NOTE_P (head) && !LABEL_P (head))
> +  /* Take debug insn into account here, otherwise we can have different
> +  DFA states after scheduling a block which 

Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches


On Tue, 7 Mar 2023, Jonathan Wakely wrote:

> > Shouldn't this use the idiom suggested in ansidecl.h, i.e.
> >
> >   private:
> > DISABLE_COPY_AND_ASSIGN (auto_mpfr);
> 
> 
> Why? A macro like that (or a base class like boost::noncopyable) has
> some value in a code base that wants to work for both C++03 and C++11
> (or later). But in GCC we know we have C++11 now, so we can just
> delete members. I don't see what the macro adds.

Evidently it's possible to forget to delete one of the members, as
showcased in this very thread.

The idiom is also slightly easier to read.

Alexander


Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches
Hi,

On Mon, 6 Mar 2023, Richard Biener via Gcc-patches wrote:

> --- a/gcc/realmpfr.h
> +++ b/gcc/realmpfr.h
> @@ -24,6 +24,26 @@
>  #include 
>  #include 
>  
> +class auto_mpfr
> +{
> +public:
> +  auto_mpfr () { mpfr_init (m_mpfr); }
> +  explicit auto_mpfr (mpfr_prec_t prec) { mpfr_init2 (m_mpfr, prec); }
> +  ~auto_mpfr () { mpfr_clear (m_mpfr); }
> +
> +  operator mpfr_t& () { return m_mpfr; }
> +
> +  auto_mpfr (const auto_mpfr &) = delete;
> +  auto_mpfr =(const auto_mpfr &) = delete;

Shouldn't this use the idiom suggested in ansidecl.h, i.e.

  private:
DISABLE_COPY_AND_ASSIGN (auto_mpfr);

Alexander


Re: RISC-V: Add divmod instruction support

2023-02-20 Thread Alexander Monakov via Gcc-patches


On Mon, 20 Feb 2023, Richard Biener via Gcc-patches wrote:

> On Sun, Feb 19, 2023 at 2:15 AM Maciej W. Rozycki  wrote:
> >
> > > The problem is you don't see it as a divmod in expand_divmod unless you 
> > > expose
> > > a divmod optab.  See tree-ssa-mathopts.cc's divmod handling.
> >
> >  That's the kind of stuff I'd expect to happen at the tree level though,
> > before expand.
> 
> The GIMPLE pass forming divmod could indeed choose to emit the
> div + mul/sub sequence instead if an actual divmod pattern isn't available.
> It could even generate some fake mul/sub/mod RTXen to cost the two
> variants against each other but I seriously doubt any uarch that implements
> division/modulo has a slower mul/sub.

Making a correct decision requires knowing to which degree the divider is
pipelined, and costs won't properly reflect that. If the divider accepts
a new div/mod instruction every couple of cycles, it's faster to just issue
a div followed by a mod with the same operands.

Therefore I think in this case it's fair for GIMPLE level to just check if
the divmod pattern is available, and let the target do the fine tuning via
the divmod expander.

It would make sense for tree-ssa-mathopts to emit div + mul/sub when neither
'divmod' nor 'mod' patterns are available, because RTL expansion will do the
same, just later, and we'll rely on RTL CSE to clean up the redundant div.
But RISC-V has both 'div' and 'mod', so as I tried to explain in the first
paragraph we should let the target decide.

Alexander


Re: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2023-01-03 Thread Alexander Monakov via Gcc-patches


On Tue, 3 Jan 2023, Jan Hubicka wrote:

> > * gcc/common/config/i386/i386-common.cc (processor_alias_table):
> > Use CPU_ZNVER4 for znver4.
> > * config/i386/i386.md: Add znver4.md.
> > * config/i386/znver4.md: New.
> OK,
> thanks!

Honza, I'm curious what are your further plans for this, you mentioned
merging znver4.md back in znver.md if I recall correctly?

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-24 Thread Alexander Monakov via Gcc-patches


On Sat, 24 Dec 2022, Jose E. Marchesi wrote:

> However, there is something I don't understand: wouldn't sched2
> introduce the same problem when -fsched2-use-superblocks is specified?

Superblocks are irrelevant, a call instruction does not end a basic block
and the problematic motion happens within a BB on your testcase. Didn't you
ask about this already?

> In that case, the option a) would need to be expanded to disable sched2
> as well, and b) wouldn't have effect (!after_reload)?

See my response to Qing Zhao, I think due to special-casing of pseudos
that are live at setjmp during register allocation, sched2 will not move
them in such manner (they should be assigned to memory and I don't expect
sched2 will move such MEMs across calls). But of course there may be holes
in this theory.

On some targets disabling sched2 is not so easy because it's responsible
for VLIW packing (bundling on ia64).

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-24 Thread Alexander Monakov via Gcc-patches


On Fri, 23 Dec 2022, Qing Zhao wrote:

> BTW, Why sched1 is not enabled on x86 by default?

Register allocation is tricky on x86 due to small number of general-purpose
registers, and sched1 can make it even more difficult. I think before register
pressure modeling was added, sched1 could not be enabled because then allocation
would sometimes fail, and now there's no incentive to enable it, as it is not so
important for modern x86 CPUs. Perhaps someone else has a more comprehensive
answer.

> Another question is:  As discussed in the original bug PR57067:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
> be represented correctly at RTL stage, shall we fix this root cause instead? 

You'd need an experienced reviewer to work with you, especially on high-level
design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
I'm afraid it's not just a matter of applying a small patch in one place.

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches



On Fri, 23 Dec 2022, Qing Zhao wrote:

> Then, sched2 still can move insn across calls? 
> So does sched2 have the same issue of incorrectly moving  the insn across a 
> call which has unknown control flow?

I think problems are unlikely because register allocator assigns pseudos that
cross setjmp to memory.

I think you hit the problem with sched1 because most testing is done on x86 and
sched1 is not enabled there, otherwise the problem would have been noticed much
earlier.

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches


On Fri, 23 Dec 2022, Jose E. Marchesi wrote:

> > (scheduling across calls in sched2 is somewhat dubious as well, but
> > it doesn't risk register pressure issues, and on VLIW CPUs it at least
> > can result in better VLIW packing)
> 
> Does sched2 actually schedule across calls?  All the comments in the
> source code stress the fact that the second scheduler pass (after
> register allocation) works in regions that correspond to basic blocks:
> "(after reload, each region is of one block)".

A call instruction does not end a basic block.

(also, with -fsched2-use-superblocks sched2 works on regions like sched1)

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches
On Fri, 23 Dec 2022, Qing Zhao wrote:
> >> I am a little confused, you mean pre-RA scheduler does not look at the 
> >> data flow
> >> information at all when scheduling insns across calls currently?
> > 
> > I think it does not inspect liveness info, and may extend lifetime of a 
> > pseudo
> > across a call, transforming
> > 
> >  call foo
> >  reg = 1
> >  ...
> >  use reg
> > 
> > to
> > 
> >  reg = 1
> >  call foo
> >  ...
> >  use reg
> > 
> > but this is undesirable, because now register allocation cannot select a
> > call-clobbered register for 'reg’.
> Okay, thanks for the explanation.
> 
> Then, why not just check the liveness info instead of inhibiting all 
> scheduling across calls?

Because there's almost nothing to gain from pre-RA scheduling across calls in
the first place. Remember that the call transfers control flow elsewhere and
therefore the scheduler has no idea about the pipeline state after the call
and after the return, so modeling-wise it's a gamble.

For instructions that lie on a critical path such scheduling can be useful when
it substantially reduces the difference between the priority of the call and
nearby instructions of the critical path. But we don't track which instructions
are on critical path(s) and which are not.

(scheduling across calls in sched2 is somewhat dubious as well, but
it doesn't risk register pressure issues, and on VLIW CPUs it at least
can result in better VLIW packing)

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-22 Thread Alexander Monakov via Gcc-patches


On Thu, 22 Dec 2022, Qing Zhao wrote:

> > I think scheduling across calls in the pre-RA scheduler is simply an 
> > oversight,
> > we do not look at dataflow information and with 50% chance risk extending
> > lifetime of a pseudoregister across a call, causing higher register 
> > pressure at
> > the point of the call, and potentially an extra spill.
> 
> I am a little confused, you mean pre-RA scheduler does not look at the data 
> flow
>  information at all when scheduling insns across calls currently?

I think it does not inspect liveness info, and may extend lifetime of a pseudo
across a call, transforming
 
  call foo
  reg = 1
  ...
  use reg

to

  reg = 1
  call foo
  ...
  use reg

but this is undesirable, because now register allocation cannot select a
call-clobbered register for 'reg'.

Alexander


Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-22 Thread Alexander Monakov via Gcc-patches


On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:

> The first instruction scheduler pass reorders instructions in the TRY
> block in a way `b=true' gets executed before the call to the function
> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
> is known to call longjmp.
> 
> As discussed in BZ 57067, the root cause for this is the fact that
> setjmp is not properly modeled in RTL, and therefore the backend
> passes have no normalized way to handle this situation.
> 
> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
> functions that call setjmp.  This includes for example gcse,
> store_motion and cprop.  This patch adds the sched1 pass to that list.
> 
> Note that the other instruction scheduling passes are still allowed to
> run on these functions, since they reorder instructions within basic
> blocks, and therefore they cannot cross function calls.
> 
> This doesn't fix the fundamental issue, but at least assures that
> sched1 wont perform invalid transformation in correct C programs.

I think scheduling across calls in the pre-RA scheduler is simply an oversight,
we do not look at dataflow information and with 50% chance risk extending
lifetime of a pseudoregister across a call, causing higher register pressure at
the point of the call, and potentially an extra spill.

Therefore I would suggest to indeed solve the root cause, with (untested):

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 948aa0c3b..343fe2bfa 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)

   CANT_MOVE (insn) = 1;

-  if (find_reg_note (insn, REG_SETJMP, NULL))
+  if (!reload_completed)
+   {
+ /* Do not schedule across calls, this is prone to extending lifetime
+of a pseudo and causing extra spill later on.  */
+ reg_pending_barrier = MOVE_BARRIER;
+   }
+  else if (find_reg_note (insn, REG_SETJMP, NULL))
 {
   /* This is setjmp.  Assume that all registers, not just
  hard registers, may be clobbered by this call.  */

Alexander


Re: -minstd: Require a minimum std version, without being specific

2022-12-21 Thread Alexander Monakov via Gcc


On Wed, 21 Dec 2022, Alejandro Colomar via Gcc wrote:

> > There's already a standard, portable way to check:
> > 
> > #if __STDC_VERSION__ < 201710
> > #error C17 required
> > #endif
> 
> Hmm, not my favourite to stick that in every public header file, but yes, it's
> portable.

I don't see why you'd need that in "every public header". Public headers
should be so simple they would have no need to check C version at all, no?

Alexander


Re: -minstd: Require a minimum std version, without being specific

2022-12-21 Thread Alexander Monakov via Gcc



On Wed, 21 Dec 2022, Alejandro Colomar via Gcc wrote:

> Hi,
> 
> I've long had this wish: an option similar to -std, but which would not
> specify the standard.  Rather, mark a requirement that the standard be at
> least a version.
> 
> This would be especially useful for libraries, which might for example require
> C99 or C11 to work.  They would be able to specify -minstd=c11 in their pc(5)
> file (for use with pkgconf(1)).

There's already a standard, portable way to check:

#if __STDC_VERSION__ < 201710
#error C17 required
#endif

Alexander


Re: [PATCH] i386: correct division modeling in lujiazui.md

2022-12-19 Thread Alexander Monakov via Gcc-patches
Ping. If there are any questions or concerns about the patch, please let me
know: I'm interested in continuing this cleanup at least for older AMD models.

I noticed I had an extra line in my Changelog:

>   (lua_sseicvt_si): Ditto.

It got there accidentally and I will drop it.

Alexander

On Fri, 9 Dec 2022, Alexander Monakov wrote:

> Model the divider in Lujiazui processors as a separate automaton to
> significantly reduce the overall model size. This should also result
> in improved accuracy, as pipe 0 should be able to accept new
> instructions while the divider is occupied.
> 
> It is unclear why integer divisions are modeled as if pipes 0-3 are all
> occupied. I've opted to keep a single-cycle reservation of all four
> pipes together, so GCC should continue trying to pack instructions
> around a division accordingly.
> 
> Currently top three symbols in insn-automata.o are:
> 
> 106102 r lujiazui_core_check
> 106102 r lujiazui_core_transitions
> 196123 r lujiazui_core_min_issue_delay
> 
> This patch shrinks all lujiazui tables to:
> 
> 3 r lujiazui_decoder_min_issue_delay
> 20 r lujiazui_decoder_transitions
> 32 r lujiazui_agu_min_issue_delay
> 126 r lujiazui_agu_transitions
> 304 r lujiazui_div_base
> 352 r lujiazui_div_check
> 352 r lujiazui_div_transitions
> 1152 r lujiazui_core_min_issue_delay
> 1592 r lujiazui_agu_translate
> 1592 r lujiazui_core_translate
> 1592 r lujiazui_decoder_translate
> 1592 r lujiazui_div_translate
> 3952 r lujiazui_div_min_issue_delay
> 9216 r lujiazui_core_transitions
> 
> This continues the work on reducing i386 insn-automata.o size started
> with similar fixes for division and multiplication instructions in
> znver.md [1][2]. I plan to submit corresponding fixes for
> b[td]ver[123].md as well.
> 
> [1] 
> https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f1215f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
> [2] 
> https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/
> 
> gcc/ChangeLog:
> 
>   PR target/87832
>   * config/i386/lujiazui.md (lujiazui_div): New automaton.
>   (lua_div): New unit.
>   (lua_idiv_qi): Correct unit in the reservation.
>   (lua_idiv_qi_load): Ditto.
>   (lua_idiv_hi): Ditto.
>   (lua_idiv_hi_load): Ditto.
>   (lua_idiv_si): Ditto.
>   (lua_idiv_si_load): Ditto.
>   (lua_idiv_di): Ditto.
>   (lua_idiv_di_load): Ditto.
>   (lua_fdiv_SF): Ditto.
>   (lua_fdiv_SF_load): Ditto.
>   (lua_fdiv_DF): Ditto.
>   (lua_fdiv_DF_load): Ditto.
>   (lua_fdiv_XF): Ditto.
>   (lua_fdiv_XF_load): Ditto.
>   (lua_ssediv_SF): Ditto.
>   (lua_ssediv_load_SF): Ditto.
>   (lua_ssediv_V4SF): Ditto.
>   (lua_ssediv_load_V4SF): Ditto.
>   (lua_ssediv_V8SF): Ditto.
>   (lua_ssediv_load_V8SF): Ditto.
>   (lua_ssediv_SD): Ditto.
>   (lua_ssediv_load_SD): Ditto.
>   (lua_ssediv_V2DF): Ditto.
>   (lua_ssediv_load_V2DF): Ditto.
>   (lua_ssediv_V4DF): Ditto.
>   (lua_ssediv_load_V4DF): Ditto.
>   (lua_sseicvt_si): Ditto.
> ---
>  gcc/config/i386/lujiazui.md | 58 +++--
>  1 file changed, 30 insertions(+), 28 deletions(-)
> 
> diff --git a/gcc/config/i386/lujiazui.md b/gcc/config/i386/lujiazui.md
> index 9046c09f2..58a230c70 100644
> --- a/gcc/config/i386/lujiazui.md
> +++ b/gcc/config/i386/lujiazui.md
> @@ -19,8 +19,8 @@
>  
>  ;; Scheduling for ZHAOXIN lujiazui processor.
>  
> -;; Modeling automatons for decoders, execution pipes and AGU pipes.
> -(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu")
> +;; Modeling automatons for decoders, execution pipes, AGU pipes, and divider.
> +(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu,lujiazui_div")
>  
>  ;; The rules for the decoder are simple:
>  ;;  - an instruction with 1 uop can be decoded by any of the three
> @@ -55,6 +55,8 @@ (define_reservation "lua_decoder01" 
> "lua_decoder0|lua_decoder1")
>  (define_cpu_unit "lua_p0,lua_p1,lua_p2,lua_p3" "lujiazui_core")
>  (define_cpu_unit "lua_p4,lua_p5" "lujiazui_agu")
>  
> +(define_cpu_unit "lua_div" "lujiazui_div")
> +
>  (define_reservation "lua_p03" "lua_p0|lua_p3")
>  (define_reservation "lua_p12" "lua_p1|lua_p2")
>  (define_reservation "lua_p1p2" "lua_p1+lua_p2")
> @@ -229,56 +231,56 @@ (define_insn_reservation "lua_idiv_qi" 21
> (and (eq_attr "memory" "none")
>  (and (eq_attr "mode" "QI")
>   (eq_attr "type" "idiv"
> -  "lua_decoder0,lua_p0p1p2p3*21")
> +  "lua_decoder0,lua_p0p1p2p3,lua_div*21")
>  
>  (define_insn_reservation "lua_idiv_qi_load" 25
>(and (eq_attr "cpu" "lujiazui")
> (and (eq_attr "memory" "load")
>  (and (eq_attr "mode" "QI")
>   (eq_attr "type" "idiv"
> - 

[PATCH] i386: correct division modeling in lujiazui.md

2022-12-09 Thread Alexander Monakov via Gcc-patches
Model the divider in Lujiazui processors as a separate automaton to
significantly reduce the overall model size. This should also result
in improved accuracy, as pipe 0 should be able to accept new
instructions while the divider is occupied.

It is unclear why integer divisions are modeled as if pipes 0-3 are all
occupied. I've opted to keep a single-cycle reservation of all four
pipes together, so GCC should continue trying to pack instructions
around a division accordingly.

Currently top three symbols in insn-automata.o are:

106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
196123 r lujiazui_core_min_issue_delay

This patch shrinks all lujiazui tables to:

3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
304 r lujiazui_div_base
352 r lujiazui_div_check
352 r lujiazui_div_transitions
1152 r lujiazui_core_min_issue_delay
1592 r lujiazui_agu_translate
1592 r lujiazui_core_translate
1592 r lujiazui_decoder_translate
1592 r lujiazui_div_translate
3952 r lujiazui_div_min_issue_delay
9216 r lujiazui_core_transitions

This continues the work on reducing i386 insn-automata.o size started
with similar fixes for division and multiplication instructions in
znver.md [1][2]. I plan to submit corresponding fixes for
b[td]ver[123].md as well.

[1] 
https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f1215f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
[2] 
https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/

gcc/ChangeLog:

PR target/87832
* config/i386/lujiazui.md (lujiazui_div): New automaton.
(lua_div): New unit.
(lua_idiv_qi): Correct unit in the reservation.
(lua_idiv_qi_load): Ditto.
(lua_idiv_hi): Ditto.
(lua_idiv_hi_load): Ditto.
(lua_idiv_si): Ditto.
(lua_idiv_si_load): Ditto.
(lua_idiv_di): Ditto.
(lua_idiv_di_load): Ditto.
(lua_fdiv_SF): Ditto.
(lua_fdiv_SF_load): Ditto.
(lua_fdiv_DF): Ditto.
(lua_fdiv_DF_load): Ditto.
(lua_fdiv_XF): Ditto.
(lua_fdiv_XF_load): Ditto.
(lua_ssediv_SF): Ditto.
(lua_ssediv_load_SF): Ditto.
(lua_ssediv_V4SF): Ditto.
(lua_ssediv_load_V4SF): Ditto.
(lua_ssediv_V8SF): Ditto.
(lua_ssediv_load_V8SF): Ditto.
(lua_ssediv_SD): Ditto.
(lua_ssediv_load_SD): Ditto.
(lua_ssediv_V2DF): Ditto.
(lua_ssediv_load_V2DF): Ditto.
(lua_ssediv_V4DF): Ditto.
(lua_ssediv_load_V4DF): Ditto.
(lua_sseicvt_si): Ditto.
---
 gcc/config/i386/lujiazui.md | 58 +++--
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/gcc/config/i386/lujiazui.md b/gcc/config/i386/lujiazui.md
index 9046c09f2..58a230c70 100644
--- a/gcc/config/i386/lujiazui.md
+++ b/gcc/config/i386/lujiazui.md
@@ -19,8 +19,8 @@
 
 ;; Scheduling for ZHAOXIN lujiazui processor.
 
-;; Modeling automatons for decoders, execution pipes and AGU pipes.
-(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu")
+;; Modeling automatons for decoders, execution pipes, AGU pipes, and divider.
+(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu,lujiazui_div")
 
 ;; The rules for the decoder are simple:
 ;;  - an instruction with 1 uop can be decoded by any of the three
@@ -55,6 +55,8 @@ (define_reservation "lua_decoder01" 
"lua_decoder0|lua_decoder1")
 (define_cpu_unit "lua_p0,lua_p1,lua_p2,lua_p3" "lujiazui_core")
 (define_cpu_unit "lua_p4,lua_p5" "lujiazui_agu")
 
+(define_cpu_unit "lua_div" "lujiazui_div")
+
 (define_reservation "lua_p03" "lua_p0|lua_p3")
 (define_reservation "lua_p12" "lua_p1|lua_p2")
 (define_reservation "lua_p1p2" "lua_p1+lua_p2")
@@ -229,56 +231,56 @@ (define_insn_reservation "lua_idiv_qi" 21
  (and (eq_attr "memory" "none")
   (and (eq_attr "mode" "QI")
(eq_attr "type" "idiv"
-"lua_decoder0,lua_p0p1p2p3*21")
+"lua_decoder0,lua_p0p1p2p3,lua_div*21")
 
 (define_insn_reservation "lua_idiv_qi_load" 25
 (and (eq_attr "cpu" "lujiazui")
  (and (eq_attr "memory" "load")
   (and (eq_attr "mode" "QI")
(eq_attr "type" "idiv"
-"lua_decoder0,lua_p45,lua_p0p1p2p3*21")
+"lua_decoder0,lua_p45,lua_p0p1p2p3,lua_div*21")
 
 (define_insn_reservation "lua_idiv_hi" 22
 (and (eq_attr "cpu" "lujiazui")
  (and (eq_attr "memory" "none")
   (and (eq_attr "mode" "HI")
(eq_attr "type" "idiv"
-"lua_decoder0,lua_p0p1p2p3*22")
+

Re: Separate warning/error thresholds for -Wfoo=

2022-12-06 Thread Alexander Monakov via Gcc


On Tue, 6 Dec 2022, Richard Biener via Gcc wrote:

> Implementation-wise one could do a similar trick as we have for
> global_options vs. global_options_set - add a global_options_error copy (ick!)
> (and global_options_error_set!?) and have the logic that decides whether
> a warning is an error pick that set.  Of course it's the actual pass that 
> looks
> at flag_xyz which holds the magic number so that pass would need to be able
> to see whether it needs to look at both numbers.

I'm hoping to come up with a reasonably elegant way to implement this.

> Btw, does '-Werror=array-bounds=2' imply that =1 kinds are diagnosed as
> warning?

No, =1 kinds are diagnosed as error as well.

> Does it enable -Warray-bounds=2?

Yes, -Werror=array-bounds=2 is simply decomposed into -Warray-bounds=2
-Werror=array-bounds= (since PR 48088 was fixed).

> I think the DWIM behavior isn't all that clear and probably depends on the
> actual option we want to make it work.

Yeah, I guess there will be differences of opinion here, and I'm interested
to reach community consensus. Let's say I'm mostly interested in supporting

  -Warray-bounds=2 -Werror=array-bounds=1

(errors for most certain OOB accesses, warnings otherwise).

Alexander


Separate warning/error thresholds for -Wfoo=

2022-12-06 Thread Alexander Monakov via Gcc
Greetings, David, community,

I'd like to get your input on how GCC command line interface should support
making a "tiered" warning like -Warray-bounds={1,2} an error for "tier 1"
where fewer false positives are expected, and a plain warning otherwise.

There was a recent thread mentioning the current limitation [1]:

> This also shows nicely why I don't like warnings with levels, what if I want
> -Werror=array-bounds=2 + -Warray-bounds=1?

Also in PR 48088 [2] there was a request to make it work for stack size usage:

> Stumbled on this bug today. I tried to use it in more intricate way:
> 
> -Wframe-larger-than=4096 -Werror=frame-larger-than=32768
> 
> which would only warn about any stack more than 4096+, but would fail on
> 32768+.
> 
> Does it make sense to implement desired behaviour?
> I guess it's not many '>=number' style options in gcc.

A problem with implementing DWIM semantics like above for -Wfoo=k -Werror=foo=n
combination is that technically it changes its current meaning.

If we don't want to risk that, an alternative is to introduce a new option
for selecting error threshold for a tiered warning, for example:

  -Warray-bounds=2 -Werror-level=array-bounds=1

Implementation-wise, we would need to extend common.opt language to annotate
which tier is more inclusive (generally smaller 'n' means fewer warnings, but
for -Wstack-usage and -Wstrict-aliasing it's the other way around).

Opinions? Does anybody envision problems with going the DWIM way?

Thanks.
Alexander

[1] 
https://inbox.sourceware.org/gcc-patches/2552ab22-916f-d0fe-2c78-d482f6ad8...@lauterbach.com/
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48088#c5


RE: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2022-12-01 Thread Alexander Monakov via Gcc-patches


On Thu, 1 Dec 2022, Joshi, Tejas Sanjay wrote:

> I have addressed all your comments in this revised patch, PFA and inlined 
> below.

Thank you. Honza, please let me know if any further input is needed
from my side. For reference, here's how insn-automata.o table sizes
look with this patch (top 17, in bytes):

20068 r bdver1_fp_check
20068 r bdver1_fp_transitions
26208 r slm_min_issue_delay
27244 r bdver1_fp_min_issue_delay
28518 r glm_check
28518 r glm_transitions
33345 r znver4_fpu_min_issue_delay
33690 r geode_min_issue_delay
46980 r bdver3_fp_min_issue_delay
49428 r glm_min_issue_delay
53730 r btver2_fp_min_issue_delay
53760 r znver1_fp_transitions
93960 r bdver3_fp_transitions
106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
133380 r znver4_fpu_transitions
196123 r lujiazui_core_min_issue_delay

There is a plan to further reduce Lujiazui and b[td]verX table sizes
by properly modeling division units like we did for znver.md (PR 87832).

Alexander


Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-11-21 Thread Alexander Monakov via Gcc-patches


On Mon, 21 Nov 2022, Jeff Law wrote:

> They're writing assembly code -- in my book that means they'd better have a
> pretty good understanding of the architecture, its limitations and quirks.

That GCC ties together optimization and inline asm interface via its internal
TARGET_MODE_REP_EXTENDED macro is not a quirk of the RISC-V architecture.
It's a quirk of GCC.

Alexander


RE: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2022-11-21 Thread Alexander Monakov via Gcc-patches


On Mon, 21 Nov 2022, Joshi, Tejas Sanjay wrote:

> I have addressed all your comments in the patch attached here. I have also
> used znver4-direct for avx512 insns.

Thanks.

> * This patch increased the insn-automata.cc size from 201502 to 214902.

Assuming it's the number of lines of code, I have 102847, perhaps you're
measuring without my patches? You can use 'size -A gcc/insn-automata.o'
to measure binary size growth.

> * Compile time and binary size on my machine remains same.
> * Make check and bootstrap build have no issues.
> * Spec cpu2017 also don't have any issues with this patch.
> 
> Is this ok for trunk?

I cannot approve or reject your patch, this is up to Honza who I believe
was investigating if combining this with older Zen models makes sense.
In the meantime, I see a few more issues that can be easily corrected,
please see below.

> --- /dev/null
> +++ b/gcc/config/i386/znver4.md
> +;; FSQRT
> +(define_insn_reservation "znver4_fsqrt" 22
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "fpspc")
> +(and (eq_attr "mode" "XF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fdiv*20")

This should be znver4-fdiv*10 (not *20) according to Agner Fog's measurements.

> +;; FDIV
> +(define_insn_reservation "znver4_fp_div" 15
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "fdiv")
> +(eq_attr "memory" "none")))
> +  "znver4-direct,znver4-fdiv*15")

znver4-fdiv*6 instead of *15 here and in two patterns following this one.

> +(define_insn_reservation "znver4_sse_div_pd" 13
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V4DF,V2DF,V1DF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fdiv*7")

Agner Fog's measurements indicate fdiv*5 here.

> +
> +(define_insn_reservation "znver4_sse_div_ps" 10
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V8SF,V4SF,V2SF,SF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fdiv*5")

Agner Fog's measurements indicate fdiv*3 here.

> +
> +(define_insn_reservation "znver4_sse_div_pd_load" 20
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V4DF,V2DF,V1DF")
> + (eq_attr "memory" "load"
> +  "znver4-direct,znver4-load,znver4-fdiv*7")

fdiv*5?

> +
> +(define_insn_reservation "znver4_sse_div_ps_load" 17
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V8SF,V4SF,V2SF,SF")
> + (eq_attr "memory" "load"
> +  "znver4-direct,znver4-load,znver4-fdiv*5")

fdiv*3?

> +(define_insn_reservation "znver4_sse_div_pd_evex" 13
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V8DF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fdiv*7")

This should be twice as much as the corresponding SSE/AVX instruction
(fdiv*14 or fdiv*10; Agner Fog measured 9 cycles as reciprocal throughput).

> +
> +(define_insn_reservation "znver4_sse_div_ps_evex" 10
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V16SF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fdiv*5")

Likewise (fdiv*6).

> +(define_insn_reservation "znver4_sse_div_pd_evex_load" 20
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V8DF")
> + (eq_attr "memory" "load"
> +  "znver4-direct,znver4-load,znver4-fdiv*7")

Likewise.

> +(define_insn_reservation "znver4_sse_div_ps_evex_load" 17
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssediv")
> +(and (eq_attr "mode" "V16SF")
> + (eq_attr "memory" "load"
> +  "znver4-direct,znver4-load,znver4-fdiv*5")


Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-11-21 Thread Alexander Monakov via Gcc-patches


On Sun, 20 Nov 2022, Jeff Law wrote:

> > The concern, as far as I understand would be the case where the
> > assembly-sequence leaves an incompatible extension in the register.
> 
> Right.  The question in my mind is whether or not the responsibility should be
> on the compiler or on the developer to ensure the ASM output is properly
> extended.  If someone's writing ASMs, then to a large degree, I consider it
> their responsibility to make sure things are properly extended

Right. They should also find out the hard way, with zero documentation telling
them they had to (let alone *why* they had to), and without a hypothetical
-fsanitize=abi that would catch exactly the point of the missing extension
instead of letting the program crash mysteriously at a much later point.
Uphill both ways, in a world where LLVM does not place such burden on the
programmer, and even among GCC targets only mips64 made a precedent (also
without documenting the new requirement).

> -- even more so
> if having the compiler do it results in slower code independent of ASMs.

I think LLVM demonstrates well enough that a compiler can do a better job
than GCC at eliminating redundant extensions without upgrading requirements
for inline asm (in the usual C code, not for sub-word outputs of an asm).

Alexander


Re: How can Autoconf help with the transition to stricter compilation defaults?

2022-11-16 Thread Alexander Monakov via Gcc


On Wed, 16 Nov 2022, Michael Matz via Gcc wrote:

> I sympathize, and I would think a compiler emitting an error (not a 
> warning) in the situation at hand (in absence of -Werror) is overly 
> pedantic.  But, could autoconf perhaps avoid the problem?  AFAICS the 
> ac_fn_c_check_func really does only a link test to check for symbol 
> existence, and the perceived problem is that the call statement in main() 
> invokes UB.  So, let's avoid the call then while retaining the access to 
> the symbol?  Like:
> 
> -
> char foobar(void);
> int main(void) {
>   return  != 0;
> }
> -
> 
> No call involved: no reason for compiler to complain.  The prototype decl 
> itself will still be "wrong", but compilers complaining about that (in 
> absence of a pre-existing different prototype, which is avoided by 
> autoconf) seem unlikely.
> 
> Obviously this program will also say "foobar exists" if it's a data 
> symbol, but that's the same with the variant using the call on most 
> platforms (after all it's not run).
> 
> The idea is so obvious that I'm probably missing something, why autoconf 
> can't use that idiom instead.  But perhaps the (historic?) reasons why it 
> couldn't be used are gone now?

Ironically, modern GCC and LLVM optimize ' != 0' to '1' even at -O0,
and thus no symbol reference remains in the resulting assembly.

Alexander


Re: [PATCH 2/2] i386: correct x87 multiplication modeling in znver.md

2022-11-16 Thread Alexander Monakov via Gcc-patches


On Wed, 16 Nov 2022, Jan Hubička wrote:

> This looks really promising.  I will experiment with the patch for separate
> znver3 model, but I think we should be able to keep
> them unified and hopefully get both less code duplicatoin and table sizes.

Do you mean separate znver4 (not '3') model (i.e. the recent patch by AMD)?

> > >  (define_insn_reservation "znver1_ssemul_avx256_pd" 5 @@ -1220,14
> > > +1220,14 @@ (define_insn_reservation "znver1_ssemul_avx256_pd" 5
> > >   (and (eq_attr "mode" "V4DF")
> > >(and (eq_attr "type" "ssemul")
> > > (eq_attr "memory" "none"
> > > -"znver1-double,(znver1-fp0|znver1-fp1)*4")
> > > +"znver1-double,znver1-fp0*2|znver1-fp1*2")
> >
> > Do we need to include "znver1" check here?
> >
> 
> If people use nonsential combinations like -mtune=znver1 -march=znver2 this
> may help a bit.
> I do it from time to time to see differences between pipelilne models, but
> it is not too important.

Actually no change is needed, the reservation already includes a check for
znver1, it's just cut off in the patch context. Here's the full context:

(define_insn_reservation "znver1_ssemul_avx256_pd" 5
 (and (eq_attr "cpu" "znver1")
  (and (eq_attr "mode" "V4DF")
   (and (eq_attr "type" "ssemul")
(eq_attr "memory" "none"
 "znver1-double,znver1-fp0*2|znver1-fp1*2")

> > >  (define_insn_reservation "znver1_sseimul_avx256" 4
> > >  (and (eq_attr "cpu" "znver1,znver2,znver3")
> > >   (and (eq_attr "mode" "OI")
> > >(and (eq_attr "type" "sseimul")
> > > (eq_attr "memory" "none"
> > > -"znver1-double,znver1-fp0*4")
> > > +"znver1-double,znver1-fp0*2")
> >
> > znver1 native path is 128  and znver2/3 has 256 bit paths.
> > We need to split this into two reservations. One for znver1 and the other
> > for znver2/3.
> >
> 
> isn't it znver2 for 128 and znver3 for 256?

No, Zen 1 splits AVX256 instructions into pairs of 128-bit uops, Zen 2 and
Zen 3 have native 256-bit units. Zen 4 again executes AVX512 instructions
on 256-bit units.

I think a split is not needed because the preceding reservation already handles
znver2 and znver3, we just need to remove them here, like this:

diff --git a/gcc/config/i386/znver.md b/gcc/config/i386/znver.md
index 882f250f1..16b5afa5d 100644
--- a/gcc/config/i386/znver.md
+++ b/gcc/config/i386/znver.md
@@ -1242,7 +1242,7 @@ (define_insn_reservation "znver1_sseimul" 3
 "znver1-direct,znver1-fp0")

 (define_insn_reservation "znver1_sseimul_avx256" 4
-(and (eq_attr "cpu" "znver1,znver2,znver3")
+(and (eq_attr "cpu" "znver1")
  (and (eq_attr "mode" "OI")
   (and (eq_attr "type" "sseimul")
(eq_attr "memory" "none"
@@ -1260,7 +1260,7 @@ (define_insn_reservation "znver1_sseimul_load" 10
 "znver1-direct,znver1-load,znver1-fp0")

 (define_insn_reservation "znver1_sseimul_avx256_load" 11
-(and (eq_attr "cpu" "znver1,znver2,znver3")
+(and (eq_attr "cpu" "znver1")
  (and (eq_attr "mode" "OI")
   (and (eq_attr "type" "sseimul")
(eq_attr "memory" "load"

> The patch looks good.
> >
> Patch is OK then :)

For *both* patches in the series?

Alexander


Re: [PATCH] doc: Reword the description of -mrelax-cmpxchg-loop [PR 107676]

2022-11-15 Thread Alexander Monakov via Gcc-patches


On Wed, 16 Nov 2022, Hongyu Wang wrote:

> > When emitting a compare-and-swap loop for @ref{__sync Builtins}
> > and @ref{__atomic Builtins} lacking a native instruction, optimize
> > for the highly contended case by issuing an atomic load before the
> > @code{CMPXCHG} instruction, and using the @code{PAUSE} instruction
> > to save CPU power when restarting the loop.
> 
> Thanks for the correction, it looks quite clear now! Here is the
> updated patch, ok for trunk?

Please use 'git commit --author' to indicate authorship of the patch
(or simply let me push it once approved).

Jonathan, please let us know if the above wording looks fine to you?
Mainly I'm asking if '... and using' or '... and use' is easier to read.

Alexander


Re: [PATCH] doc: Reword the description of -mrelax-cmpxchg-loop [PR 107676]

2022-11-15 Thread Alexander Monakov via Gcc-patches


On Tue, 15 Nov 2022, Jonathan Wakely wrote:

> > How about the following:
> >
> > When emitting a compare-and-swap loop for @ref{__sync Builtins}
> > and @ref{__atomic Builtins} lacking a native instruction, optimize
> > for the highly contended case by issuing an atomic load before the
> > @code{CMPXCHG} instruction, and invoke the @code{PAUSE} instruction
> > when restarting the loop.
> 
> That's much better, thanks. My only remaining quibble would be that
> "invoking" an instruction seems only marginally better than running
> one. Emitting? Issuing? Using? Adding?

Right, it should be 'using'; let me also add 'to save CPU power':

When emitting a compare-and-swap loop for @ref{__sync Builtins}
and @ref{__atomic Builtins} lacking a native instruction, optimize
for the highly contended case by issuing an atomic load before the
@code{CMPXCHG} instruction, and using the @code{PAUSE} instruction
to save CPU power when restarting the loop.

Alexander


Re: [PATCH] doc: Reword the description of -mrelax-cmpxchg-loop [PR 107676]

2022-11-15 Thread Alexander Monakov via Gcc-patches
On Tue, 15 Nov 2022, Jonathan Wakely via Gcc-patches wrote:

> > @item -mrelax-cmpxchg-loop
> > @opindex mrelax-cmpxchg-loop
> >-Relax cmpxchg loop by emitting an early load and compare before cmpxchg,
> >-execute pause if load value is not expected. This reduces excessive
> >-cachline bouncing when and works for all atomic logic fetch builtins
> >-that generates compare and swap loop.
> >+For compare and swap loops that emitted by some __atomic_* builtins
> 
> s/that emitted/that are emitted/
> 
> >+(e.g. __atomic_fetch_(or|and|xor|nand) and their __atomic_*_fetch
> >+counterparts), emit an atomic load before cmpxchg instruction. If the
> 
> s/before cmpxchg/before the cmpxchg/
> 
> >+loaded value is not equal to expected, execute a pause instead of
> 
> s/not equal to expected/not equal to the expected/
> 
> >+directly run the cmpxchg instruction. This might reduce excessive
> 
> s/directly run/directly running/

This results in "... execute a pause instead of directly running the
cmpxchg instruction", which needs further TLC because:

* 'a pause' should be 'the PAUSE instruction';
* 'directly running [an instruction]' does not seem correct in context.

The option also applies to __sync builtins, not just __atomic.


How about the following:

When emitting a compare-and-swap loop for @ref{__sync Builtins}
and @ref{__atomic Builtins} lacking a native instruction, optimize
for the highly contended case by issuing an atomic load before the
@code{CMPXCHG} instruction, and invoke the @code{PAUSE} instruction
when restarting the loop.

Alexander


RE: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2022-11-15 Thread Alexander Monakov via Gcc-patches


On Tue, 15 Nov 2022, Joshi, Tejas Sanjay wrote:

> > > +;; AVX instructions
> > > +(define_insn_reservation "znver4_sse_log" 1
> > > +  (and (eq_attr "cpu" "znver4")
> > > +   (and (eq_attr "type" "sselog,sselog1")
> > > +(and (eq_attr "mode" 
> > > "V4SF,V8SF,V2DF,V4DF")
> > > + (eq_attr "memory" "none"
> > > +  "znver4-direct,znver4-fpu")
> > > +
> > > +(define_insn_reservation "znver4_sse_log_evex" 1
> > > +  (and (eq_attr "cpu" "znver4")
> > > +   (and (eq_attr "type" "sselog,sselog1")
> > > +(and (eq_attr "mode" "V16SF,V8DF")
> > > + (eq_attr "memory" "none"
> > > +
> > > +"znver4-direct,znver4-fpu0+znver4-fpu1|znver4-fpu2+znver4-fpu3")
> > > +
> > 
> > This is an AVX512 instruction, and you're modeling that it occupies two 
> > ports
> > at once and thus has half throughput, but later in the AVX512 section:
> > 
> > > +;; AVX512 instructions
> > > +(define_insn_reservation "znver4_sse_mul_evex" 3
> > > +  (and (eq_attr "cpu" "znver4")
> > > +   (and (eq_attr "type" "ssemul")
> > > +(and (eq_attr "mode" "V16SF,V8DF")
> > > + (eq_attr "memory" "none"
> > > +  "znver4-double,znver4-fpu0|znver4-fpu3")
> > 
> > none of the instructions are modeled this way. If that's on purpose, can you
> > add a comment? It's surprising, since generally AVX512 has half throughput
> > compared to AVX256 on Zen 4, but the model doesn't seem to reflect that.
> 
> > > +"znver4-direct,znver4-fpu0+znver4-fpu1|znver4-fpu2+znver4-fpu3")
> 
> AVX512 instructions (512-bitwide) occupy 2 consecutive cycles in the pipes
> they execute. So, it should be modelled as shown below:
> 
> (define_insn_reservation "znver4_sse_log_evex" 1
>(and (eq_attr "cpu" "znver4")
> (and (eq_attr "type" "sselog")
>  (and (eq_attr "mode" "V16SF,V8DF,XI")
>   (eq_attr "memory" "none"
>"znver4-double,(znver4-fpu)*2")

I think instead of (znver4-fpu)*2 there should be

  znver4-fpu0*2|znver4-fpu1*2|znver4-fpu2*2|znver4-fpu3*2

assuming the instruction occupies the same pipe on both cycles (your
variant models as if it can move from one pipe to another).

> (define_insn_reservation "znver4_sse_mul_evex" 3
>(and (eq_attr "cpu" "znver4")
> (and (eq_attr "type" "ssemul")
>  (and (eq_attr "mode" "V16SF,V8DF")
>   (eq_attr "memory" "none"
>"znver4-double,(znver4-fpu0|znver4-fpu1)*2")

Likewise here, znver4-fpu0*2|znver4-fpu1*2.

> Doing this way increased the insn-automata.cc size from 201402 lines to 
> 212189.

Please reevaluate on top of my patches, the impact will be different.

> Hope it is a tolerable increase or do you have any suggestions?

Please take the corrections above into account.

Also I think it's better to use znver4-direct rather than znver4-double for
AVX512 instructions, because they are decoded as one uop, not two (it won't
make a practical difference due to a "Fix me", but it's a simple improvement).

Thanks.

Alexander


Re: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2022-11-14 Thread Alexander Monakov via Gcc-patches


On Mon, 14 Nov 2022, Joshi, Tejas Sanjay wrote:

> [Public]
> 
> Hi,

Hi. I'm still waiting for feedback on fixes for existing models:
https://inbox.sourceware.org/gcc-patches/5ae6fc21-edc6-133-aee2-a41e16eb...@ispras.ru/T/#t
did you have a chance to look at those?

> PFA the patch which adds znver4 instruction reservations separately from older
> znver versions:
> * This also models separate div, fdiv and ssediv units accordingly.

Why are you modeling 'fdiv' and 'ssediv' separately? When preparing the above
patches, I checked that x87 and SSE divisions use the same hardware unit, and
I don't see a strong reason to artificially clone it in the model.

(integer divider is a separate unit from the floating-point divider)

> * Does not blow-up the insn-automata.cc size (it grew from 201502 to 206141 
> for me.)
> * The patch successfully builds, bootstraps, and passes make check.
> * I have also run spec, showing no regressions for 1-copy 3-iteration runs. 
> However, I observe 1.5% gain for 507.cactuBSSN_r.

I have a question on AVX512 modeling in your patch:

> +;; AVX instructions
> +(define_insn_reservation "znver4_sse_log" 1
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "sselog,sselog1")
> +(and (eq_attr "mode" "V4SF,V8SF,V2DF,V4DF")
> + (eq_attr "memory" "none"
> +  "znver4-direct,znver4-fpu")
> +
> +(define_insn_reservation "znver4_sse_log_evex" 1
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "sselog,sselog1")
> +(and (eq_attr "mode" "V16SF,V8DF")
> + (eq_attr "memory" "none"
> +  
> "znver4-direct,znver4-fpu0+znver4-fpu1|znver4-fpu2+znver4-fpu3")
> +

This is an AVX512 instruction, and you're modeling that it occupies two ports at
once and thus has half throughput, but later in the AVX512 section:

> +;; AVX512 instructions
> +(define_insn_reservation "znver4_sse_mul_evex" 3
> +  (and (eq_attr "cpu" "znver4")
> +   (and (eq_attr "type" "ssemul")
> +(and (eq_attr "mode" "V16SF,V8DF")
> + (eq_attr "memory" "none"
> +  "znver4-double,znver4-fpu0|znver4-fpu3")

none of the instructions are modeled this way. If that's on purpose, can you
add a comment? It's surprising, since generally AVX512 has half throughput
compared to AVX256 on Zen 4, but the model doesn't seem to reflect that.

Alexander


Re: [PATCH 0/2] i386: slim down insn-automata [PR 87832]

2022-11-14 Thread Alexander Monakov via Gcc-patches


On Mon, 7 Nov 2022, Alexander Monakov wrote:

> 
> On Tue, 1 Nov 2022, Alexander Monakov wrote:
> 
> > Hi,
> > 
> > I'm sending followup fixes for combinatorial explosion of znver scheduling
> > automaton tables as described in the earlier thread:
> > 
> > https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f1215f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
> 
> AMD folks, do you have any feedback?
> 
> What is the way forward for this patchset?

Ping?

> Alexander
> 
> > 
> > I think lujiazui.md and b[dt]ver[123].md have similar issues.
> > 
> > Alexander Monakov (2):
> >   i386: correct x87 division modeling in znver.md
> >   i386: correct x87 multiplication modeling in znver.md
> > 
> >  gcc/config/i386/znver.md | 67 
> >  1 file changed, 34 insertions(+), 33 deletions(-)
> > 
> > 
> 


Re: [RFC] docs: remove documentation for unsupported releases

2022-11-10 Thread Alexander Monakov via Gcc-patches


On Thu, 10 Nov 2022, Martin Liška wrote:

> On 11/10/22 08:29, Gerald Pfeifer wrote:
> > On Wed, 9 Nov 2022, Alexander Monakov wrote:
> >> For this I would suggest using the  tag to neatly fold links
> >> for old releases. Please see the attached patch.
> > 
> > Loving it, Alexander!
> > 
> > What do you guys think about unfolding all releases we, the GCC project,
> > currently support (per https://gcc.gnu.org that'd be 12.x, 11.x, and 10.x
> > at this point)?
> 
> Works for me!
> 
> > 
> > Either way: yes, please (aka approved). :-)
> 
> Alexander, can you please install such change?

Yes, pushed: https://gcc.gnu.org/onlinedocs/

Alexander


Re: [RFC] docs: remove documentation for unsupported releases

2022-11-09 Thread Alexander Monakov via Gcc-patches

On Wed, 9 Nov 2022, Martin Liška wrote:

> Hi.
> 
> I think we should remove documentation for unsupported GCC releases
> as it's indexed by Google engine.

I'd agree with previous responses that outright removing the links is
undesirable, and pointing Google to recent documentation should be done
by annotating links, e.g. via rel=canonical as indicated by Joseph.

I would add that adding rel=canonical links seems doable without modifying
old files, by configuring the web server to add HTTP Link: header.

> Second reason is that the page is long
> one one can't easily jump to Current development documentation.

For this I would suggest using the  tag to neatly fold links for
old releases. Please see the attached patch.

AlexanderFrom ab6ce8c24aa17dba8ed79f3c3f7a5e8038dd3205 Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Wed, 9 Nov 2022 22:17:16 +0300
Subject: [PATCH] Fold doc links for old releases using  tag

---
 htdocs/onlinedocs/index.html | 151 ++-
 1 file changed, 96 insertions(+), 55 deletions(-)

diff --git a/htdocs/onlinedocs/index.html b/htdocs/onlinedocs/index.html
index 3410f731..03cbdbeb 100644
--- a/htdocs/onlinedocs/index.html
+++ b/htdocs/onlinedocs/index.html
@@ -18,8 +18,8 @@
  caring about internals should really be using the mainline
  versions.  -->
 
-
-  GCC 12.2 manuals:
+
+ GCC 12.2 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/;>GCC
  12.2 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-12.2.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 12.2 manuals
-  
+  
+
 
-  GCC 11.3 manuals:
+
+  GCC 11.3 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/;>GCC
  11.3 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-11.3.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 11.3 manuals
-  
+  
+
 
-  GCC 10.4 manuals:
+
+  GCC 10.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-10.4.0/gcc/;>GCC
  10.4 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-10.4.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 10.4 manuals
-  
+  
+
 
-  GCC 9.5 manuals:
+
+  GCC 9.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-9.5.0/gcc/;>GCC
  9.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-9.5.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 9.5 manuals
-  
+  
+
 
-  GCC 8.5 manuals:
+
+  GCC 8.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-8.5.0/gcc/;>GCC
  8.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-8.5.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 8.5 manuals
-  
+  
+
 
-  GCC 7.5 manuals:
+
+  GCC 7.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/;>GCC
  7.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-7.5.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 7.5 manuals
-  
+  
+
 
-  GCC 6.5 manuals:
+
+  GCC 6.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-6.5.0/gcc/;>GCC
  6.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-6.5.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 6.5 manuals
-  
+  
+
 
-  GCC 5.5 manuals:
+
+  GCC 5.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/;>GCC
  5.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-5.5.0/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 5.5 manuals
-  
+  
+
 
-  GCC 4.9.4 manuals:
+
+  GCC 4.9.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.9.4/gcc/;>GCC
  4.9.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.9.4/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 4.9.4 manuals
-  
+  
+
 
-  GCC 4.8.5 manuals:
+
+  GCC 4.8.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/;>GCC
  4.8.5 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.8.5/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 4.8.5 manuals
-  
-
+  
+
 
-  GCC 4.7.4 manuals:
+
+  GCC 4.7.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/;>GCC
  4.7.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.7.4/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 4.7.4 manuals
-  
-
+  
+
 
-  GCC 4.6.4 manuals:
+
+  GCC 4.6.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/;>GCC
  4.6.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.6.4/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 4.6.4 manuals
-  
-
+  
+
 
-  GCC 4.5.4 manuals:
+
+  GCC 4.5.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.5.4/gcc/;>GCC
  4.5.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.5.4/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 4.5.4 manuals
-  
-
+  
+
 
-  GCC 4.4.7 manuals:
+
+  GCC 4.4.7 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.4.7/gcc/;>GCC
  4.4.7 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.4.7/docs-sources.tar.gz;>Texinfo
  sources of all the GCC 

Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-11-09 Thread Alexander Monakov via Gcc-patches
On Wed, 9 Nov 2022, Philipp Tomsich wrote:

> > To give a specific example that will be problematic if you go far enough 
> > down
> > the road of matching MIPS64 behavior:
> >
> > long f(void)
> > {
> > int x;
> > asm("" : "=r"(x));
> > return x;
> > }
> >
> > here GCC (unlike LLVM) omits sign extension of 'x', assuming that asm output
> > must have been sign-extended to 64 bits by the programmer.
> 
> In fact, with the proposed patch (but also without it), GCC will sign-extend:
> f:
>   sext.w a0,a0
>   ret
>   .size f, .-f

I'm aware. I said "will be problematic if ...", meaning that GCC omits sign
extension when targeting MIPS64, and if you match MIPS64 behavior on RISC-V, 
you'll get in that situation as well.

> To make sure that this is not just an extension to promote the int to
> long for the function return, I next added another empty asm to
> consume 'x'.
> This clearly shows that the extension is performed to postprocess the
> output of the asm-statement:
> 
> f:
>   # ./asm2.c:4: asm("" : "=r"(x));
>   sext.w a0,a0 # x, x
>   # ./asm2.c:5: asm("" : : "r"(x));
>   # ./asm2.c:7: }
>   ret

No, you cannot distinguish post-processing the output of the first asm vs.
pre-processing the input of the second asm. Try

  asm("" : "+r"(x));

as the second asm instead, and you'll get

f:
# t.c:17: asm("" : "=r"(x));
# t.c:18: asm("" : "+r"(x));
# t.c:20: }
sext.w  a0,a0   #, x
ret

If it helps, here's a Compiler Explorer link comparing with MIPS64:
https://godbolt.org/z/7eobvdKdK

Alexander


Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-28 Thread Alexander Monakov via Gcc-patches


On Tue, 27 Sep 2022, Tobias Burnus wrote:

> Ignoring (1), does the overall patch and this part otherwise look okay(ish)?
> 
> 
> Caveat: The .sys scope works well with >= sm_60 but not does not handle
> older versions. For those, the __atomic_{load/store}_n are used.  I do not
> see a good solution beyond documentation. In the way it is used (one
> thread only setting only on/off flag, no atomic increments etc.), I think
> it is unlikely to cause races without .sys scope, but as always is
> difficult to rule out some special unfortunate case where it does. At
> lease we do have now some documentation (in general) - which still needs
> to be expanded and improved.  For this feature, I did not add any wording
> in this patch: until the feature is actually enabled, it would be more
> confusing than helpful.

If the implication is that distros will ship a racy-by-default implementation,
unless they know about the problem and configure for sm_60, then no, that
doesn't look fine to me. A possible solution is not enabling a feature that
has a known correctness issue.

Alexander


Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-26 Thread Alexander Monakov via Gcc-patches


Hi.

My main concerns remain not addressed:

1) what I said in the opening paragraphs of my previous email;

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

Item 2 is a correctness issue. Item 1 I think is a matter of policy that
is up to you to hash out with Jakub.

On Mon, 26 Sep 2022, Tobias Burnus wrote:

> In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
> -foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.

I have no words.

Alexander


Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-21 Thread Alexander Monakov via Gcc-patches


Hi.

On the high level, I'd be highly uncomfortable with this. I guess we are in
vague agreement that it cannot be efficiently implemented. It also goes
against the good practice of accelerator programming, which requires queueing
work on the accelerator and letting it run asynchronously with the CPU with high
occupancy.

(I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
but maybe it's better to improve *that* instead of piling on new slowness)

What I said above also applies to MPI+GPU scenarios: a well-designed algorithm
should arrange for MPI communications to happen in parallel with some useful
offloaded calculations. I don't see the value in implementing the ability to
invoke an MPI call from the accelerator in such inefficient fashion.

(so yes, I disagree with "it is better to provide a feature even if it is slow –
than not providing it at all", when it is advertised as a general-purpose
feature, not a purely debugging helper)


On to the patch itself. IIRC one of the questions was use of CUDA managed
memory. I think it is unsafe because device-issued atomics are not guaranteed
to appear atomic to the host, unless compiling for compute capability 6.0 or
above, and using system-scope atomics ("atom.sys").

And for non-USM code path you're relying on cudaMemcpy observing device-side
atomics in the right order.

Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
structure. Did you rule it out for some reason?

Some remarks on the diff below, not intended to be a complete review.

Alexander


> --- a/libgomp/config/nvptx/target.c
> +++ b/libgomp/config/nvptx/target.c
> @@ -26,7 +26,29 @@
>  #include "libgomp.h"
>  #include 
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var

Shouldn't this be in a header (needs to be in sync with the plugin).

> +
> +/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};

Likewise.

> +
> +#if (__SIZEOF_SHORT__ != 2 \
> + || __SIZEOF_SIZE_T__ != 8 \
> + || __SIZEOF_POINTER__ != 8)
> +#error "Data-type conversion required for rev_offload"
> +#endif

Huh? This is not a requirement that is new for reverse offload, it has always
been like that for offloading (all ABI rules regarding type sizes, struct
layout, bitfield layout, endianness must match).

> +
> +
>  extern int __gomp_team_num __attribute__((shared));
> +extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
> +volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
>  
>  bool
>  GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
> @@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t 
> mapnum,
>void **hostaddrs, size_t *sizes, unsigned short *kinds,
>unsigned int flags, void **depend, void **args)
>  {
> -  (void) device;
> -  (void) fn;
> -  (void) mapnum;
> -  (void) hostaddrs;
> -  (void) sizes;
> -  (void) kinds;
>(void) flags;
>(void) depend;
>(void) args;
> -  __builtin_unreachable ();
> +
> +  if (device != GOMP_DEVICE_HOST_FALLBACK
> +  || fn == NULL
> +  || GOMP_REV_OFFLOAD_VAR == NULL)
> +return;

Shouldn't this be an 'assert' instead?

> +
> +  while (__sync_lock_test_and_set (_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
> +;  /* spin  */
> +
> +  __atomic_store_n (_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (_REV_OFFLOAD_VAR->addrs, hostaddrs, 
> __ATOMIC_SEQ_CST);
> +  __atomic_store_n (_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (_REV_OFFLOAD_VAR->dev_num,
> + GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);

Looks like all these can be plain stores, you only need ...

> +
> +  /* 'fn' must be last.  */
> +  __atomic_store_n (_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);

... this to be atomic with 'release' semantics in the usual producer-consumer
pattern.

> +
> +  /* Processed on the host - when done, fn is set to NULL.  */
> +  while (__atomic_load_n (_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
> +;  /* spin  */
> +  __sync_lock_release (_REV_OFFLOAD_VAR->lock);
>  }
>  
>  void
> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
> index 9d4cc62..316de74 100644
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>gomp_vfatal (msg, ap);
>va_end (ap);
>  }
> +
> +void
> +GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t 
> devaddrs_ptr,
> + uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> + void (*dev_to_host_cpy) (void *, const void *, size_t,
> +  void *),
> +   

Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg

2022-09-16 Thread Alexander Monakov via Gcc-patches
On Fri, 16 Sep 2022, Uros Bizjak via Gcc-patches wrote:

> On Fri, Sep 16, 2022 at 3:32 AM Jeff Law via Gcc-patches
>  wrote:
> >
> >
> > On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> > > There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> > > reg + test reg, reg. I don't know exact reason why gcc do this.
> > >
> > > For latest x86 processors, ciscization should help processor frontend
> > > also codesize, for processor backend, they should be the same(has same
> > > uops).
> > >
> > > So the patch deleted the peephole2, and also modify another splitter to
> > > generate more cmp mem, 0 for 32-bit target.
> > >
> > > It will help instruction fetch.
> > >
> > > for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan 
> > > there's no
> > > comparison to 1 or -1, so adjust the testcase since under 32-bit
> > > target, we now generate cmp mem, 0 instead of load + test.
> > >
> > > Similar for pr78035.c.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > > No performance impact for SPEC2017 on ICX/Znver3.
> > >
> > It was almost certainly for PPro/P2 given it was rth's work from
> > 1999.Probably should have been conditionalized on PPro/P2 at the
> > time.   No worries losing it now...
> 
> Please add a tune flag in x86-tune.def under "Historical relics" and
> use it in the relevant peephole2 instead of deleting it.

When the next instruction after 'load mem; test reg, reg' is a conditional
branch, this disables macro-op fusion because Intel CPUs do not macro-fuse
'cmp mem, imm; jcc'.

It would be nice to rephrase the commit message to acknowledge this (the
statement 'has same uops' is not always true with this considered).

AMD CPUs can fuse some 'cmp mem, imm; jcc' under some conditions, so this
should be beneficial for AMD.

Alexander


Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-09-06 Thread Alexander Monakov via Gcc-patches
On Mon, 5 Sep 2022, Philipp Tomsich wrote:

> +riscv_mode_rep_extended (scalar_int_mode mode, scalar_int_mode mode_rep)
> +{
> +  /* On 64-bit targets, SImode register values are sign-extended to DImode.  
> */
> +  if (TARGET_64BIT && mode == SImode && mode_rep == DImode)
> +return SIGN_EXTEND;

I think this leads to a counter-intuitive requirement that a hand-written
inline asm must sign-extend its output operands that are bound to either
signed or unsigned 32-bit lvalues. Will compiler users be aware of that?

Moreover, without adjusting TARGET_TRULY_NOOP_TRUNCATION this should cause
miscompilation when a 64-bit variable is truncated to 32 bits: the pre-existing
hook says that nothing needs to be done to truncate, but the new hook says
that the result of the truncation is properly sign-extended.

The documentation for TARGET_MODE_REP_EXTENDED warns about that:

In order to enforce the representation of mode, TARGET_TRULY_NOOP_TRUNCATION
should return false when truncating to mode. 

Alexander


Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-30 Thread Alexander Monakov via Gcc-patches
> I see, thank you for explaining the issue, and sorry if I was a bit stubborn.
> 
> Does the attached patch (incremental change below) look better? It no longer
> has the 'shortcut' where iterating over referrers is avoided for the common
> case of plain 'gcc -O2' and no 'optimize' attributes, but fortunately TLS
> variables are not so numerous to make chasing that worthwhile.

... and of course I forgot to add the attachment.  Revised patch below.

---8<---

>From b245015ec465604799aef60b224b1e1e264d4cb8 Mon Sep 17 00:00:00 2001
From: Artem Klimov 
Date: Wed, 6 Jul 2022 17:02:01 +0300
Subject: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

Fix PR99619, which asks to optimize TLS model based on visibility.
The fix is implemented as an IPA optimization: this allows to take
optimized visibility status into account (as well as avoid modifying
all language frontends).

2022-04-17  Artem Klimov  

gcc/ChangeLog:

* ipa-visibility.cc (function_and_variable_visibility): Promote
TLS access model afer visibility optimizations.
* varasm.cc (have_optimized_refs): New helper.
(optimize_dyn_tls_for_decl_p): New helper. Use it ...
(decl_default_tls_model): ... here in place of 'optimize' check.

gcc/testsuite/ChangeLog:

* gcc.dg/tls/vis-attr-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden.c: New test.
* gcc.dg/tls/vis-flag-hidden-gd.c: New test.
* gcc.dg/tls/vis-flag-hidden.c: New test.
* gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
* gcc.dg/tls/vis-pragma-hidden.c: New test.

Co-Authored-By:  Alexander Monakov  
Signed-off-by: Artem Klimov 
---
 gcc/ipa-visibility.cc | 19 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
 .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
 gcc/varasm.cc | 32 ++-
 9 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c

diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
index 8a27e7bcd..3ed2b7cf6 100644
--- a/gcc/ipa-visibility.cc
+++ b/gcc/ipa-visibility.cc
@@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
}
 }
 
+  if (symtab->state >= IPA_SSA)
+{
+  FOR_EACH_VARIABLE (vnode)
+   {
+ tree decl = vnode->decl;
+
+ /* Upgrade TLS access model based on optimized visibility status,
+unless it was specified explicitly or no references remain.  */
+ if (DECL_THREAD_LOCAL_P (decl)
+ && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
+ && vnode->ref_list.referring.length ())
+   {
+ enum tls_model new_model = decl_default_tls_model (decl);
+ gcc_checking_assert (new_model >= decl_tls_model (decl));
+ set_decl_tls_model (decl, new_model);
+   }
+   }
+}
+
   if (dump_file)
 {
   fprintf (dump_file, "\nMarking local functions:");
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 4db8506b1..ffc559431 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -6679,6 +6679,36 @@ init_varasm_once (void)
 #endif
 }
 
+/* Determine whether SYMBOL is used in any optimized function.  */
+
+static bool
+have_optimized_refs (struct symtab_node *symbol)
+{
+  struct ipa_ref *ref;
+
+  for (int i = 0; symbol->iterate_referring (i, ref); i++)
+{
+  cgraph_node *cnode = dyn_cast  (ref->referring);
+
+  if (cnode && opt_for_fn (cnode->decl, optimize))
+   return true;
+}
+
+  return false;
+}
+
+/* Check if promoting general-dynamic TLS access model to local-dynamic is
+   desirable for DECL.  */
+
+static bool
+optimize_dyn_tls_for_decl_p (const_tree decl)
+{
+  if (cfun)
+return optimize;
+  return symtab->state >= IPA && have_optimized_refs (symtab_node::get (decl));
+}
+
+
 enum tls_model
 decl_default_tls_model (const_tree decl)
 {
@@ -6696,7 +6726,7 @@ decl_default_tls_model (const_tree decl)
 
   /* Local dynamic is inefficient when we're not combining the
  parts of the address.  */
-  else if (optimize && is_local)
+  else if (is_local && optimize_dyn_tls_for_decl_p (decl))
 kind = TLS_MODEL_LOCAL_DYNAMIC;

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-30 Thread Alexander Monakov via Gcc-patches
On Tue, 30 Aug 2022, Martin Jambor wrote:

> There is still the optimize attribute so in fact no, even in non-LTO
> mode if there is no current function, you cannot trust the "global"
> "optimize" thing.
> 
> Ideally we would assert that no "analysis" phase of an IPA pass reads
> the global optimization flags, so please don't add new places where we
> do.
> 
> You can either add a parameter to decl_default_tls_model to tell it
> what context it is called from and IMHO it would also be acceptable to
> check whether we have a non-NULL cfun and decide based on that (but here
> I only hope it is not something others might object to).

I see, thank you for explaining the issue, and sorry if I was a bit stubborn.

Does the attached patch (incremental change below) look better? It no longer
has the 'shortcut' where iterating over referrers is avoided for the common
case of plain 'gcc -O2' and no 'optimize' attributes, but fortunately TLS
variables are not so numerous to make chasing that worthwhile.

--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -6703,8 +6703,8 @@ have_optimized_refs (struct symtab_node *symbol)
 static bool
 optimize_dyn_tls_for_decl_p (const_tree decl)
 {
-  if (optimize)
-return true;
+  if (cfun)
+return optimize;
   return symtab->state >= IPA && have_optimized_refs (symtab_node::get (decl));
 }

Alexander


Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-08-26 Thread Alexander Monakov via Gcc-patches


On Fri, 26 Aug 2022, Tobias Burnus wrote:

> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I think to do that without polling you can use PTX 'brkpt' instruction on the
device and CUDA Debugger API on the host (but you'd have to be careful about
interactions with the real debugger).

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

Alexander


Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-26 Thread Alexander Monakov via Gcc-patches
On Fri, 26 Aug 2022, Martin Jambor wrote:

> > +/* Check if promoting general-dynamic TLS access model to local-dynamic is
> > +   desirable for DECL.  */
> > +
> > +static bool
> > +optimize_dyn_tls_for_decl_p (const_tree decl)
> > +{
> > +  if (optimize)
> > +return true;
> 
> ...this.  This is again an access to optimize which in LTO IPA phase is
> not really meaningful.  Can the test be simply removed?  If not (but
> please look at why), I'd suggest to test the overall optimize level only
> if there is a non-NULL cfun.

I'd prefer to keep it. This code is also called from the front-ends when
assigning initial TLS access model (during parsing, at the point where
visibility attributes, if present, have not been processed yet). There
we don't have IPA structures, but 'optimize' is set up.

I also want to avoid iterating over referring functions in non-LTO mode
where trusting 'optimize' should be fine during this IPA pass I think?

Alexander


Re: public-inbox for gcc lists

2022-08-25 Thread Alexander Monakov via Gcc


On Thu, 25 Aug 2022, Mark Wielaard wrote:

> Hi gcc-hackers,
> 
> Given that gcc is part of the sourceware family the mailinglists are
> now also available through the public-inbox instance at 
> https://inbox.sourceware.org/

Thanks for this. At the moment something seems broken with a few lists,
as clicking on a thread results in a 404 page, e.g. any top link from
https://inbox.sourceware.org/gcc-bugs/

Alexander


Re: [PATCH] i386: avoid zero extension for crc32q

2022-08-24 Thread Alexander Monakov via Gcc-patches
On Tue, 23 Aug 2022, Alexander Monakov via Gcc-patches wrote:

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr106453.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
> +/* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */

I noticed that the test is 64-bit only and added the following fixup in my
tree:

--- a/gcc/testsuite/gcc.target/i386/pr106453.c
+++ b/gcc/testsuite/gcc.target/i386/pr106453.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile { target { ! ia32 } } */
 /* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
 /* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */




[PATCH] i386: avoid zero extension for crc32q

2022-08-23 Thread Alexander Monakov via Gcc-patches
The crc32q instruction takes 64-bit operands, but ignores high 32 bits
of the destination operand, and zero-extends the result from 32 bits.

Let's model this in the RTL pattern to avoid zero-extension when the
_mm_crc32_u64 intrinsic is used with a 32-bit type.

PR target/106453

gcc/ChangeLog:

* config/i386/i386.md (sse4_2_crc32di): Model that only low 32
bits of operand 0 are consumed, and the result is zero-extended
to 64 bits.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr106453.c: New test.
---
 gcc/config/i386/i386.md  |  6 +++---
 gcc/testsuite/gcc.target/i386/pr106453.c | 13 +
 2 files changed, 16 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr106453.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 58fcc382f..b5760bb23 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -23823,10 +23823,10 @@
 
 (define_insn "sse4_2_crc32di"
   [(set (match_operand:DI 0 "register_operand" "=r")
-   (unspec:DI
- [(match_operand:DI 1 "register_operand" "0")
+   (zero_extend:DI (unspec:SI
+ [(match_operand:SI 1 "register_operand" "0")
   (match_operand:DI 2 "nonimmediate_operand" "rm")]
- UNSPEC_CRC32))]
+ UNSPEC_CRC32)))]
   "TARGET_64BIT && TARGET_CRC32"
   "crc32{q}\t{%2, %0|%0, %2}"
   [(set_attr "type" "sselog1")
diff --git a/gcc/testsuite/gcc.target/i386/pr106453.c 
b/gcc/testsuite/gcc.target/i386/pr106453.c
new file mode 100644
index 0..bab5b1cb2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106453.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
+/* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */
+
+#include 
+#include 
+
+uint32_t f(uint32_t c, uint64_t *p, size_t n)
+{
+for (size_t i = 0; i < n; i++)
+c = _mm_crc32_u64(c, p[i]);
+return c;
+}
-- 
2.35.1



Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-23 Thread Alexander Monakov via Gcc-patches
Ping^3.

On Fri, 5 Aug 2022, Alexander Monakov wrote:

> Ping^2.
> 
> On Wed, 20 Jul 2022, Alexander Monakov wrote:
> 
> > 
> > Ping.
> > 
> > On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:
> > 
> > > From: Artem Klimov 
> > > 
> > > Fix PR99619, which asks to optimize TLS model based on visibility.
> > > The fix is implemented as an IPA optimization: this allows to take
> > > optimized visibility status into account (as well as avoid modifying
> > > all language frontends).
> > > 
> > > 2022-04-17  Artem Klimov  
> > > 
> > > gcc/ChangeLog:
> > > 
> > >   * ipa-visibility.cc (function_and_variable_visibility): Promote
> > >   TLS access model afer visibility optimizations.
> > >   * varasm.cc (have_optimized_refs): New helper.
> > >   (optimize_dyn_tls_for_decl_p): New helper. Use it ...
> > >   (decl_default_tls_model): ... here in place of 'optimize' check.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >   * gcc.dg/tls/vis-attr-gd.c: New test.
> > >   * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-attr-hidden.c: New test.
> > >   * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-flag-hidden.c: New test.
> > >   * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-pragma-hidden.c: New test.
> > > 
> > > Co-Authored-By:  Alexander Monakov  
> > > Signed-off-by: Artem Klimov 
> > > ---
> > > 
> > > v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> > > in decl_default_tls_model, check if any referring function is 
> > > optimized
> > > when 'optimize == 0' (when running in LTO mode)
> > > 
> > > 
> > > Note for reviewers: I noticed there's a place which tries to avoid TLS
> > > promotion, but the comment seems wrong and I could not find a testcase.
> > > I'd suggest we remove it. The compiler can only promote general-dynamic
> > > to local-dynamic and initial-exec to local-exec. The comment refers to
> > > promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> > > https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> > > 
> > > 
> > >  gcc/ipa-visibility.cc | 19 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
> > >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
> > >  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
> > >  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
> > >  gcc/varasm.cc | 32 ++-
> > >  9 files changed, 145 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> > > 
> > > diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> > > index 8a27e7bcd..3ed2b7cf6 100644
> > > --- a/gcc/ipa-visibility.cc
> > > +++ b/gcc/ipa-visibility.cc
> > > @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
> > >   }
> > >  }
> > >  
> > > +  if (symtab->state >= IPA_SSA)
> > > +{
> > > +  FOR_EACH_VARIABLE (vnode)
> > > + {
> > > +   tree decl = vnode->decl;
> > > +
> > > +   /* Upgrade TLS access model based on optimized visibility status,
> > > +  unless it was specified explicitly or no references remain.  */
> > > +   if (DECL_THREAD_LOCAL_P (decl)
> > > +   && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> > > +   && vnode->ref_list.referring.length ())
> > > + {
> > > +   enum tls_model new_model = decl_default_tls_model (decl);
&

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-05 Thread Alexander Monakov via Gcc-patches
Ping^2.

On Wed, 20 Jul 2022, Alexander Monakov wrote:

> 
> Ping.
> 
> On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:
> 
> > From: Artem Klimov 
> > 
> > Fix PR99619, which asks to optimize TLS model based on visibility.
> > The fix is implemented as an IPA optimization: this allows to take
> > optimized visibility status into account (as well as avoid modifying
> > all language frontends).
> > 
> > 2022-04-17  Artem Klimov  
> > 
> > gcc/ChangeLog:
> > 
> > * ipa-visibility.cc (function_and_variable_visibility): Promote
> > TLS access model afer visibility optimizations.
> > * varasm.cc (have_optimized_refs): New helper.
> > (optimize_dyn_tls_for_decl_p): New helper. Use it ...
> > (decl_default_tls_model): ... here in place of 'optimize' check.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.dg/tls/vis-attr-gd.c: New test.
> > * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-attr-hidden.c: New test.
> > * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-flag-hidden.c: New test.
> > * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-pragma-hidden.c: New test.
> > 
> > Co-Authored-By:  Alexander Monakov  
> > Signed-off-by: Artem Klimov 
> > ---
> > 
> > v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> > in decl_default_tls_model, check if any referring function is optimized
> > when 'optimize == 0' (when running in LTO mode)
> > 
> > 
> > Note for reviewers: I noticed there's a place which tries to avoid TLS
> > promotion, but the comment seems wrong and I could not find a testcase.
> > I'd suggest we remove it. The compiler can only promote general-dynamic
> > to local-dynamic and initial-exec to local-exec. The comment refers to
> > promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> > https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> > 
> > 
> >  gcc/ipa-visibility.cc | 19 +++
> >  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
> >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
> >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
> >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
> >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
> >  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
> >  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
> >  gcc/varasm.cc | 32 ++-
> >  9 files changed, 145 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> > 
> > diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> > index 8a27e7bcd..3ed2b7cf6 100644
> > --- a/gcc/ipa-visibility.cc
> > +++ b/gcc/ipa-visibility.cc
> > @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
> > }
> >  }
> >  
> > +  if (symtab->state >= IPA_SSA)
> > +{
> > +  FOR_EACH_VARIABLE (vnode)
> > +   {
> > + tree decl = vnode->decl;
> > +
> > + /* Upgrade TLS access model based on optimized visibility status,
> > +unless it was specified explicitly or no references remain.  */
> > + if (DECL_THREAD_LOCAL_P (decl)
> > + && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> > + && vnode->ref_list.referring.length ())
> > +   {
> > + enum tls_model new_model = decl_default_tls_model (decl);
> > + gcc_checking_assert (new_model >= decl_tls_model (decl));
> > + set_decl_tls_model (decl, new_model);
> > +   }
> > +   }
> > +}
> > +
> >if (dump_file)
> >  {
> >fprintf (dump_file, "\nMarking local functions:");
> > diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> > index 4db8506b1..de149e82c 100644
> > --- a/gcc/varasm.cc
> > +++ b/gcc/varasm.cc
> > @@ -6679,6 +6679,36 @

Re: Setting up editors for the GNU/GCC coding style?

2022-07-28 Thread Alexander Monakov via Gcc


On Thu, 28 Jul 2022, David Malcolm via Gcc wrote:

> Is there documentation on setting up text editors to work with our
> coding style?  A lot of the next generation of developers aren't using
> vi or emacs; they's using VS Code, CLion, and other editors.  Does
> anyone have docs on e.g. how to set up VS Code, CLion, etc (IntelliJ ?)
> to work well on GCC's own code base.  FWIW I use Emacs; I've dabbed
> with VS Code but haven't used it "for real".
> 
> I'm hoping to add this to my newbies guide.

The tricky part is setting it up so it doesn't die indexing the codebase
(e.g. adding the entire toplevel directory probably won't work: it will
try to index the testsuites).

FWIW, CLion documentation shows how to select GNU coding style, but as
above, that's not the main worry:
https://www.jetbrains.com/help/clion/predefined-code-styles.html

Alexander


Re: [PATCH 2/2] Avoid registering __builtin_setjmp_receiver label twice [PR101347]

2022-07-20 Thread Alexander Monakov via Gcc-patches


On Wed, 20 Jul 2022, Eric Botcazou wrote:

> > Eric is probably most familiar with this, but can you make sure to bootstrap
> > and test this on a SJLJ EH target?  I'm not sure --enable-sjlj-exceptions
> > is well tested anywhere but on targets not supporting DWARF EH and the
> > configury is a bit odd suggesting the option is mostly ignored ...
> 
> This is a specific circuitry for __builtln_setjmp so it is *not* exercised by 
> the SJLJ exception scheme.  It used to be exercised by the GNAT bootstrap, 
> but 
> that's no longer the case either.
> 
> I think that the fix is sensible, assuming that it passes the C testsuite.

Yes, it passes the usual regtest.  Thanks, applying to trunk.

Alexander


Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-07-20 Thread Alexander Monakov via Gcc-patches


Ping.

On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:

> From: Artem Klimov 
> 
> Fix PR99619, which asks to optimize TLS model based on visibility.
> The fix is implemented as an IPA optimization: this allows to take
> optimized visibility status into account (as well as avoid modifying
> all language frontends).
> 
> 2022-04-17  Artem Klimov  
> 
> gcc/ChangeLog:
> 
>   * ipa-visibility.cc (function_and_variable_visibility): Promote
>   TLS access model afer visibility optimizations.
>   * varasm.cc (have_optimized_refs): New helper.
>   (optimize_dyn_tls_for_decl_p): New helper. Use it ...
>   (decl_default_tls_model): ... here in place of 'optimize' check.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/tls/vis-attr-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden.c: New test.
>   * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-flag-hidden.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden.c: New test.
> 
> Co-Authored-By:  Alexander Monakov  
> Signed-off-by: Artem Klimov 
> ---
> 
> v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> in decl_default_tls_model, check if any referring function is optimized
> when 'optimize == 0' (when running in LTO mode)
> 
> 
> Note for reviewers: I noticed there's a place which tries to avoid TLS
> promotion, but the comment seems wrong and I could not find a testcase.
> I'd suggest we remove it. The compiler can only promote general-dynamic
> to local-dynamic and initial-exec to local-exec. The comment refers to
> promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> 
> 
>  gcc/ipa-visibility.cc | 19 +++
>  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
>  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
>  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
>  gcc/varasm.cc | 32 ++-
>  9 files changed, 145 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> 
> diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> index 8a27e7bcd..3ed2b7cf6 100644
> --- a/gcc/ipa-visibility.cc
> +++ b/gcc/ipa-visibility.cc
> @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
>   }
>  }
>  
> +  if (symtab->state >= IPA_SSA)
> +{
> +  FOR_EACH_VARIABLE (vnode)
> + {
> +   tree decl = vnode->decl;
> +
> +   /* Upgrade TLS access model based on optimized visibility status,
> +  unless it was specified explicitly or no references remain.  */
> +   if (DECL_THREAD_LOCAL_P (decl)
> +   && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> +   && vnode->ref_list.referring.length ())
> + {
> +   enum tls_model new_model = decl_default_tls_model (decl);
> +   gcc_checking_assert (new_model >= decl_tls_model (decl));
> +   set_decl_tls_model (decl, new_model);
> + }
> + }
> +}
> +
>if (dump_file)
>  {
>fprintf (dump_file, "\nMarking local functions:");
> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 4db8506b1..de149e82c 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -6679,6 +6679,36 @@ init_varasm_once (void)
>  #endif
>  }
>  
> +/* Determine whether SYMBOL is used in any optimized function.  */
> +
> +static bool
> +have_optimized_refs (struct symtab_node *symbol)
> +{
> +  struct ipa_ref *ref;
> +
> +  for (int i = 0; symbol->iterate_referring (i, ref); i++)
> +{
> +  cgraph_node *cnode = dyn_cast  (ref->referring);
> +
> +  if (cnode && opt_for_fn (cnode->decl, optimize))
> + return true;
> +}
&

Re: Use -ftls-model=local-exec for RTEMS by default?

2022-07-20 Thread Alexander Monakov via Gcc


On Wed, 20 Jul 2022, Sebastian Huber wrote:

> On 20/07/2022 13:41, Alexander Monakov wrote:
> > On Wed, 20 Jul 2022, Sebastian Huber wrote:
> > 
> >> How does Ada get its default TLS model?
> > You shouldn't need to do anything special, GCC automatically selects
> > initial-exec or local-exec for non-PIC (including PIE).
> 
> I am not sure, for this test program:
> 
> extern _Thread_local int i;
> _Thread_local int j;
> 
> int f(void)
> {
>   return i + j;
> }
> 
> I get:
[snip]

Thanks, I missed that you are asking about promoting initial-exec to local-exec
rather than x-dynamic to y-exec. There's a pending patch that implements such
promotion based on visibility information:
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598017.html

With that patch, you'll get local-exec model for the extern variable 'i' if you
inform the compiler that its definition will end up in the current module:

__attribute__((visibility("hidden")))
extern _Thread_local int i;
_Thread_local int j;

int f(void)
{
  return i + j;
}

Thus I would try to enhance the binds_local_p target hook for RTEMS to inform
the compiler that there's no dynamic linking (although apart from TLS variables
I cannot instantly name other places where it would enhance optimization).

HTH
Alexander


Re: Use -ftls-model=local-exec for RTEMS by default?

2022-07-20 Thread Alexander Monakov via Gcc
On Wed, 20 Jul 2022, Sebastian Huber wrote:

> How does Ada get its default TLS model?

You shouldn't need to do anything special, GCC automatically selects
initial-exec or local-exec for non-PIC (including PIE).

Alexander


Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-19 Thread Alexander Monakov via Gcc-patches
On Tue, 19 Jul 2022, Richard Biener wrote:

> > Like below?
> 
> Yes.
> 
> Thanks and sorry for the back and forth - this _is_ a mightly
> complicated area ...

No problem!  This is the good, healthy kind of back-and-forth, and
I am grateful.

Pushed, including the tree-cfg validator enhancement in patch 3/3.

Alexander


[PATCH 2/2] Avoid registering __builtin_setjmp_receiver label twice [PR101347]

2022-07-19 Thread Alexander Monakov via Gcc-patches
The testcase in the PR demonstrates how it is possible for one
__builtin_setjmp_receiver label to appear in
nonlocal_goto_handler_labels list twice (after the block with
__builtin_setjmp_setup referring to it was duplicated).

remove_node_from_insn_list did not account for this possibility and
removed only the first copy from the list. Add an assert verifying that
duplicates are not present.

To avoid adding a label to the list twice, move registration of the
label from __builtin_setjmp_setup handling to __builtin_setjmp_receiver.

gcc/ChangeLog:

PR rtl-optimization/101347
* builtins.cc (expand_builtin) [BUILT_IN_SETJMP_SETUP]: Move
population of nonlocal_goto_handler_labels from here ...
(expand_builtin) [BUILT_IN_SETJMP_RECEIVER]: ... to here.
* rtlanal.cc (remove_node_from_insn_list): Verify that a
duplicate is not present in the remainder of the list.
---
 gcc/builtins.cc | 15 +++
 gcc/rtlanal.cc  |  1 +
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index e6816d5c8..12a688dd8 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -7467,15 +7467,7 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
  tree label = TREE_OPERAND (CALL_EXPR_ARG (exp, 1), 0);
  rtx_insn *label_r = label_rtx (label);
 
- /* This is copied from the handling of non-local gotos.  */
  expand_builtin_setjmp_setup (buf_addr, label_r);
- nonlocal_goto_handler_labels
-   = gen_rtx_INSN_LIST (VOIDmode, label_r,
-nonlocal_goto_handler_labels);
- /* ??? Do not let expand_label treat us as such since we would
-not want to be both on the list of non-local labels and on
-the list of forced labels.  */
- FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;
@@ -7488,6 +7480,13 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
  rtx_insn *label_r = label_rtx (label);
 
  expand_builtin_setjmp_receiver (label_r);
+ nonlocal_goto_handler_labels
+   = gen_rtx_INSN_LIST (VOIDmode, label_r,
+nonlocal_goto_handler_labels);
+ /* ??? Do not let expand_label treat us as such since we would
+not want to be both on the list of non-local labels and on
+the list of forced labels.  */
+ FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index ec95ecd6c..56da7435a 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -2899,6 +2899,7 @@ remove_node_from_insn_list (const rtx_insn *node, 
rtx_insn_list **listp)
  else
*listp = temp->next ();
 
+ gcc_checking_assert (!in_insn_list_p (temp->next (), node));
  return;
}
 
-- 
2.35.1



[PATCH 1/2] Remove unused remove_node_from_expr_list

2022-07-19 Thread Alexander Monakov via Gcc-patches
This function remains unused since remove_node_from_insn_list was cloned
from it.

gcc/ChangeLog:

* rtl.h (remove_node_from_expr_list): Remove declaration.
* rtlanal.cc (remove_node_from_expr_list): Remove (no uses).
---
 gcc/rtl.h  |  1 -
 gcc/rtlanal.cc | 29 -
 2 files changed, 30 deletions(-)

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 488016bb4..645c009a3 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -3712,7 +3712,6 @@ extern unsigned hash_rtx_cb (const_rtx, machine_mode, int 
*, int *,
 extern rtx regno_use_in (unsigned int, rtx);
 extern int auto_inc_p (const_rtx);
 extern bool in_insn_list_p (const rtx_insn_list *, const rtx_insn *);
-extern void remove_node_from_expr_list (const_rtx, rtx_expr_list **);
 extern void remove_node_from_insn_list (const rtx_insn *, rtx_insn_list **);
 extern int loc_mentioned_in_p (rtx *, const_rtx);
 extern rtx_insn *find_first_parameter_load (rtx_insn *, rtx_insn *);
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index d78cc6024..ec95ecd6c 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -2878,35 +2878,6 @@ in_insn_list_p (const rtx_insn_list *listp, const 
rtx_insn *node)
   return false;
 }
 
-/* Search LISTP (an EXPR_LIST) for an entry whose first operand is NODE and
-   remove that entry from the list if it is found.
-
-   A simple equality test is used to determine if NODE matches.  */
-
-void
-remove_node_from_expr_list (const_rtx node, rtx_expr_list **listp)
-{
-  rtx_expr_list *temp = *listp;
-  rtx_expr_list *prev = NULL;
-
-  while (temp)
-{
-  if (node == temp->element ())
-   {
- /* Splice the node out of the list.  */
- if (prev)
-   XEXP (prev, 1) = temp->next ();
- else
-   *listp = temp->next ();
-
- return;
-   }
-
-  prev = temp;
-  temp = temp->next ();
-}
-}
-
 /* Search LISTP (an INSN_LIST) for an entry whose first operand is NODE and
remove that entry from the list if it is found.
 
-- 
2.35.1



[committed] .gitignore: do not ignore config.h

2022-07-19 Thread Alexander Monakov via Gcc-patches
GCC does not support in-tree builds at the moment, so .gitignore
concealing artifacts of accidental in-tree ./configure run may cause
confusion. Un-ignore config.h, which is known to break the build.

ChangeLog:

* .gitignore: Do not ignore config.h.
---
 .gitignore | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 021a8c741..5cc4a0fdf 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,7 +23,8 @@
 
 autom4te.cache
 config.cache
-config.h
+# GCC does not support in-tree builds, do not conceal a stray config.h:
+# config.h
 config.intl
 config.log
 config.status
-- 
2.35.1


Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-14 Thread Alexander Monakov via Gcc-patches


On Thu, 14 Jul 2022, Richard Biener wrote:

> Indeed.  Guess that's what __builtin_setjmp[_receiver] for SJLJ_EH got 
> "right".
> 
> When copying a block we do not copy labels so any "jumps" remain to the 
> original
> block and thus we are indeed able to isolate normal control flow.  Given that
> returns_twice functions _do_ seem to be special, and also special as to how
> we handle other abnormal receivers in duplicate_block.

We do? Sorry, I don't see what you mean here, can you point me to specific 
lines?

> So it might indeed make sense to special-case them in can_duplicate_block_p
> ... (sorry for going back-and-forth ...)
> 
> Note that I think this detail of duplicate_block (the function) and the hook
> needs to be better documented (the semantics on incoming edges, not 
> duplicating
> labels used for incoming control flow).
> 
> Can you see as to how to adjust the RTL side for this?  It looks like at least
> some places set a REG_SETJMP note on call_insns (emit_call_1), I wonder
> if in rtl_verify_flow_info[_1] (or its callees) we can check that such
> calls come
> first ... they might not since IIRC we do _not_ preserve abnormal edges when
> expanding RTL (there's some existing bug about this and how this breaks some
> setjmp tests) (but we try to recompute them?).

No, we emit arguments/return value handling before/after a REG_SETJMP call,
and yeah, we don't always properly recompute abnormal edges, so improving
RTL in this respect seems hopeless. For example, it is easy enough to create
a testcase where bb-reordering duplicates a returns_twice call, although it
runs so late that perhaps later passes don't care:

// gcc -O2 --param=max-grow-copy-bb-insns=100
__attribute__((returns_twice))
int rtwice(int);
int g1(int), g2(int);
void f(int i)
{
  do {
i = i%2 ? g1(i) : g2(i);
  } while (i = rtwice(i));
}

FWIW, I also investigated https://gcc.gnu.org/PR101347

> Sorry about the back-and-forth again ... your original patch looks OK for the
> GIMPLE side but can you amend the cfghooks.{cc,h} documentation to
> summarize our findings and
> the desired semantics of duplicate_block in this respect?

Like below?

---8<---

Subject: [PATCH v3] tree-cfg: do not duplicate returns_twice calls

A returns_twice call may have associated abnormal edges that correspond
to the "second return" from the call. If the call is duplicated, the
copies of those edges also need to be abnormal, but e.g. tracer does not
enforce that. Just prohibit the (unlikely to be useful) duplication.

gcc/ChangeLog:

* cfghooks.cc (duplicate_block): Expand comment.
* tree-cfg.cc (gimple_can_duplicate_bb_p): Reject blocks with
calls that may return twice.
---
 gcc/cfghooks.cc | 13 ++---
 gcc/tree-cfg.cc |  7 +--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/gcc/cfghooks.cc b/gcc/cfghooks.cc
index e435891fa..c6ac9532c 100644
--- a/gcc/cfghooks.cc
+++ b/gcc/cfghooks.cc
@@ -1086,9 +1086,16 @@ can_duplicate_block_p (const_basic_block bb)
   return cfg_hooks->can_duplicate_block_p (bb);
 }
 
-/* Duplicates basic block BB and redirects edge E to it.  Returns the
-   new basic block.  The new basic block is placed after the basic block
-   AFTER.  */
+/* Duplicate basic block BB, place it after AFTER (if non-null) and redirect
+   edge E to it (if non-null).  Return the new basic block.
+
+   If BB contains a returns_twice call, the caller is responsible for 
recreating
+   incoming abnormal edges corresponding to the "second return" for the copy.
+   gimple_can_duplicate_bb_p rejects such blocks, while RTL likes to live
+   dangerously.
+
+   If BB has incoming abnormal edges for some other reason, their destinations
+   should be tied to label(s) of the original BB and not the copy.  */
 
 basic_block
 duplicate_block (basic_block bb, edge e, basic_block after, copy_bb_data *id)
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index f846dc2d8..5bcf78198 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -6346,12 +6346,15 @@ gimple_can_duplicate_bb_p (const_basic_block bb)
 {
   gimple *g = gsi_stmt (gsi);
 
-  /* An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
+  /* Prohibit duplication of returns_twice calls, otherwise associated
+abnormal edges also need to be duplicated properly.
+An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
 duplicated as part of its group, or not at all.
 The IFN_GOMP_SIMT_VOTE_ANY and IFN_GOMP_SIMT_XCHG_* are part of such a
 group, so the same holds there.  */
   if (is_gimple_call (g)
- && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
+ && (gimple_call_flags (g) & ECF_RETURNS_TWICE
+ || gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
  || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)
  || gimple_call_internal_p (g, IFN_GOMP_SIMT_VOTE_ANY)
  || gimple_call_internal_p (g, 

Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-13 Thread Alexander Monakov via Gcc-patches
On Wed, 13 Jul 2022, Richard Biener wrote:

> > > The thing to check would be incoming abnormal edges in
> > > can_duplicate_block_p, not (only) returns twice functions?
> >
> > Unfortunately not, abnormal edges are also used for computed gotos, which 
> > are
> > less magic than returns_twice edges and should not block tracer I think.
> 
> I think computed gotos should use regular edges, only non-local goto should
> use abnormals...

Yeah, afaict it's not documented what "abnormal" is supposed to mean :/

> I suppose asm goto also uses abnormal edges?

Heh, no, asm goto appears to use normal edges, but there's an old gap in
their specification: can you use them like computed gotos, i.e. can asm-goto
jump to a computed target? Or must they be similar to plain gotos where the
jump label is redirectable (because it's substitutable in the asm template)?

If you take a restrictive interpretation (asm goto may not jump to a computed
label) then using regular edges looks fine.

> Btw, I don't see how they in general are "less magic".  Sure, we have an
> explicit receiver (the destination label), but we can only do edge inserts
> if we have a single computed goto edge into a block (we can "move" the
> label to the block created when splitting the edge).

Sure, they are a bit magic, but returns_twice edges are even more magic: their
destination looks tied to a label in the IR, but in reality their destination
is inside a call that returns twice (hence GCC must be careful not to insert
anything between the label and the call, like in patch 1/3).

> > This implies patch 1/3 [1] unnecessary blocks sinking to computed goto 
> > targets.
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588498.html
> >
> > How would you like to proceed here? Is my initial patch ok?
> 
> Hmm, so for returns twice calls duplicate_block correctly copies the call
> and redirects the provided incoming edge to it.  The API does not
> handle adding any further incoming edges - the caller would be responsible
> for this.  So I still somewhat fail to see the point here.  If tracer does not
> handle extra incoming edges properly then we need to fix tracer?

I think abnormal edges corresponding to computed gotos are fine: we are
attempting to create a chain of blocks with no incoming edges in the middle,
right? Destinations of computed gotos remain at labels of original blocks.

Agreed about correcting this in the tracer.

> This also includes non-local goto (we seem to copy non-local labels just
> fine - wasn't there a bugreport about this!?).

Sorry, no idea about this.

> So I think can_duplicate_block_p is the wrong place to fix (the RTL side
> would need a similar fix anyhow?)

Right. I'm happy to leave both RTL and GIMPLE can_duplicate_block_p as is,
and instead constrain just the tracer. Alternative patch below:

* tracer.cc (analyze_bb): Disallow duplication of returns_twice calls.

diff --git a/gcc/tracer.cc b/gcc/tracer.cc
index 64517846d..422e2b6a7 100644
--- a/gcc/tracer.cc
+++ b/gcc/tracer.cc
@@ -132,14 +132,19 @@ analyze_bb (basic_block bb, int *count)
   gimple *stmt;
   int n = 0;

+  bool can_dup = can_duplicate_block_p (CONST_CAST_BB (bb));
+
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next ())
 {
   stmt = gsi_stmt (gsi);
   n += estimate_num_insns (stmt, _size_weights);
+  if (can_dup && cfun->calls_setjmp && gimple_code (stmt) == GIMPLE_CALL
+ && gimple_call_flags (stmt) & ECF_RETURNS_TWICE)
+   can_dup = false;
 }
   *count = n;

-  cache_can_duplicate_bb_p (bb, can_duplicate_block_p (CONST_CAST_BB (bb)));
+  cache_can_duplicate_bb_p (bb, can_dup);
 }

 /* Return true if E1 is more frequent than E2.  */



  1   2   >