[PATCH] Tweak language choice in config-list.mk

2023-09-07 Thread Richard Sandiford via Gcc-patches
When I tried to use config-list.mk, the build for every triple except
the build machine's failed for m2.  This is because, unlike other
languages, m2 builds target objects during all-gcc.  The build will
therefore fail unless you have access to an appropriate binutils
(or an equivalent).  That's quite a big ask for over 100 targets. :)

This patch therefore makes m2 an optional inclusion.

Doing that wasn't entirely straightforward though.  The current
configure line includes "--enable-languages=all,...", which means
that the "..." can only force languages to be added that otherwise
wouldn't have been.  (I.e. the only effect of the "..." is to
override configure autodetection.)

The choice of all,ada and:

  # Make sure you have a recent enough gcc (with ada support) in your path so
  # that --enable-werror-always will work.

make it clear that lack of GNAT should be a build failure rather than
silently ignored.  This predates the D frontend, which requires GDC
in the same way that Ada requires GNAT.  I don't know of a reason
why D should be treated differently.

The patch therefore expands the "all" into a specific list of
languages.

That in turn meant that Fortran had to be handled specially,
since bpf and mmix don't support Fortran.

Perhaps there's an argument that m2 shouldn't build target objects
during all-gcc, but (a) it works for practical usage and (b) the
patch is an easy workaround.  I'd be happy for the patch to be
reverted if the build system changes.

OK to install?

Richard


gcc/
* contrib/config-list.mk (OPT_IN_LANGUAGES): New variable.
($(LIST)): Replace --enable-languages=all with a specifc list.
Disable fortran on bpf and mmix.  Enable the languages in
OPT_IN_LANGUAGES.
---
 contrib/config-list.mk | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/contrib/config-list.mk b/contrib/config-list.mk
index e570b13c71b..50ecb014bc0 100644
--- a/contrib/config-list.mk
+++ b/contrib/config-list.mk
@@ -12,6 +12,11 @@ TEST=all-gcc
 # supply an absolute path.
 GCC_SRC_DIR=../../gcc
 
+# Define this to ,m2 if you want to build Modula-2.  Modula-2 builds target
+# objects during all-gcc, so it can only be included if you've installed
+# binutils (or an equivalent) for each target.
+OPT_IN_LANGUAGES=
+
 # Use -j / -l make arguments and nice to assure a smooth resource-efficient
 # load on the build machine, e.g. for 24 cores:
 # svn co svn://gcc.gnu.org/svn/gcc/branches/foo-branch gcc
@@ -126,17 +131,23 @@ $(LIST): make-log-dir
TGT=`echo $@ | awk 'BEGIN { FS = "OPT" }; { print $$1 }'` &&
\
TGT=`$(GCC_SRC_DIR)/config.sub $$TGT` &&
\
case $$TGT in   
\
-   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   bpf-*-*)
\
ADDITIONAL_LANGUAGES="";
\
;;  
\
-   *)  
\
+   *-*-darwin* | *-*-cygwin* | *-*-mingw* | *-*-aix* | 
bpf-*-*)\
+   ADDITIONAL_LANGUAGES=",fortran";
\
+   ;;  
\
+   mmix-*-*)   
\
ADDITIONAL_LANGUAGES=",go"; 
\
;;  
\
+   *)  
\
+   ADDITIONAL_LANGUAGES=",fortran,go"; 
\
+   ;;  
\
esac && 
\
$(GCC_SRC_DIR)/configure
\
--target=$(subst SCRIPTS,`pwd`/../scripts/,$(subst 
OPT,$(empty) -,$@))  \
--enable-werror-always ${host_options}  
\
-   --enable-languages=all,ada$$ADDITIONAL_LANGUAGES;   
\
+   
--enable-languages=c,ada,c++,d,lto,objc,obj-c++,rust$$ADDITIONAL_LANGUAGES$(OPT_IN_LANGUAGES);
 \
) > log/$@-config.out 2>&1
 
 $(LOGFILES) : log/%-make.out : %
-- 
2.25.1



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>unsupported masking for `vmovsh'.  I guess that's a latent backend
>problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>(cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>That looks like a costing problem and can hopefully solveable by making
>the second compare more expensive, preventing the propagation.
>i386 costing (or every costing?) is brittle so that could well break other
>things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>propagated with different register classes (V2DF vs DI).  With the patch
>we check the register pressure, find the class NO_REGS for V2DF and
>abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>of keeping the old behavior for reg-reg propagations and only checking
>the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>set a REG_EQUALS note.
>Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>We create redundant whilelos and I'm not sure how to improve that. I
>guess a peephole is out of the question :)
>
>  - pred-combine-and.c: Here the new propagation appears useful at first.
>We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>individual and registers remain live up to the propagation site (while
>being dead before the patch).
>With the registers dead, combine could create a single fcmgt before.
>Now it only manages a 2->2 combination because we still need the registers
>and end up with two fcmgts.
>The code is worse but this seems more bad luck than anything.
>
>  - Addressing fails from before:  I looked into these and suspect all of
>them are a similar.
>What happens is that we have a poly_int offset that we shift, negate
>and then add to x0.  The result is used as load address.
>Before, we would pull (combine) the (plus x0 reg) into the load keeping
>the neg and shift.
>Now we propagate everything into a single (set (minus x0 offset)).
>The propagation itself seems worthwhile because we save one insn.
>However as we got rid of the base/offset split by lumping everything
>together, combine cannot pull the (plus) into the address load and
>we require an aarch64_split_add_offset.  This will emit the longer
>sequence of ashiftl and subtract.  The "base" address is x0 here so
>we cannot convert (minus x0 ...)) into neg.
>I didn't go through all of aarch64_split_add_offset.  I suppose we
>could re-add the separation of base/offset there but that might be
>a loss when the result is not used as an address. 
>
> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?

Thanks for giving it a go.  Can you post the latest version of the
regpressure patch too?  The previous on-list version I could find
seems to be too old.

Thanks,
Richard

> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>   handling.
>   (fwprop_propagation::update_register_pressure): New function.
>   (fwprop_propagation::register_pressure_high_p): New function
>   (reg_single_def_for_src_p): Look through unary expressions.
>   (try_fwprop_subst_pattern): Check register pressure.
>   (forward_propagate_into): Call new function.
>   (fwprop_init): Init register pressure.
>   (fwprop_done): Clean up register pressure.
>   (fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc | 359 +-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"

Re: Question on aarch64 prologue code.

2023-09-06 Thread Richard Sandiford via Gcc
Iain Sandoe  writes:
> Hi Folks,
>
> On the Darwin aarch64 port, we have a number of cleanup test fails (pretty 
> much corresponding to the [still open] 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39244).  However, let’s assume 
> that bug could be a red herring..
>
> the underlying reason is missing CFI for the set of the FP which [with 
> Darwin’s LLVM libunwind impl.] breaks the unwind through the function that 
> triggers a signal.

Just curious, do you have more details about why that is?  If the unwinder
is sophisticated enough to process CFI, it seems odd that it requires the
CFA to be defined in terms of the frame pointer.
>
> ———
>
> taking one of the functions in cleanup-8.C (say fn1) which contains calls.
>
> what I am seeing is something like:
>
> __ZL3fn1v:
> LFB28:
> ; BLOCK 2, count:1073741824 (estimated locally) seq:0
> ; PRED: ENTRY [always]  count:1073741824 (estimated locally, freq 1.) 
> (FALLTHRU)
>   stp x29, x30, [sp, -32]!
> // LCFI; or .cfi_xxx is present
>   mov x29, sp
> // *** NO  LCFI (or .cfi_cfa_ when that is enabled)
>   str x19, [sp, 16]
> // LCFI / .cfi_ is present.
>   adrpx19, __ZL3fn4i@PAGE
>   add x19, x19, __ZL3fn4i@PAGEOFF;momd
>   mov x1, x19
>   mov w0, 11
>   bl  _signal
> 
>
> ———
>
> The reason seems to be that, in expand_prolog, emit_frame_chain is true (as 
> we would expect, given that this function makes calls).  However 
> ‘frame_pointer_needed' is false, so that the call to aarch64_add_offset() 
> [line aarch64.cc:10405] does not add CFA adjustments to the load of x29.

Right.

> ———
>
> I have currently worked around this by defining a 
> TARGET_FRAME_POINTER_REQUIRED which returns true unless the function is a 
> leaf (if that’s the correct solution, then all is fine).

I suppose it depends on why the frame-pointer-based CFA is important
for Darwin.  If it's due to a more general requirement for a frame
pointer to be used, then yeah, that's probably the right fix.  If it's
more a quirk of the unwinder. then we could probably expose whatever
that quirk is as a new status bit.  Target-independent code in
dwarf2cfi.cc would then need to be aware as well.

> ———
>
> However, it does seem odd that the existing code sets up the FP, but never 
> produces any CFA for it.
>
> So is this a possible bug, or just that I misunderstand the relevant set of 
> circumstances?

emit_frame_chain fulfills an ABI requirement that every non-leaf function
set up a frame-chain record.  When emit_frame_chain && !frame_pointer_needed,
we set up the FP for ABI purposes only.  GCC can still access everything
relative to the stack pointer, and it can still describe the CFI based
purely on the stack pointer.

glibc-based systems only need the CFA to be based on the frame pointer
if the stack pointer moves during the body of the function (usually due
to alloca or VLAs).

Thanks,
Richard


Re: [PATCH] LoongArch: Fix unintentional bash-ism in r14-3665.

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> gcc/ChangeLog:
>
>   * config.gcc: remove non-POSIX syntax "<<<".

OK.  Thanks for the quick fix.

Richard.

> ---
>  gcc/config.gcc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b2fe7c7ceef..6d4c8becd28 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -5189,7 +5189,7 @@ case "${target}" in
>   if test x${parse_state} = x"abi-base"; then
>   # Base ABI type
>   case ${component} in
> - lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(echo ${component} | tr a-z A-Z),";;
>   *)
>   echo "Unknown base ABI 
> \"${component}\" in --with-multilib-list." 1>&2
>   exit 1


Re: [PATCH v1 2/6] LoongArch: improved target configuration interface

2023-09-06 Thread Richard Sandiford via Gcc-patches
Yang Yujie  writes:
> @@ -5171,25 +5213,21 @@ case "${target}" in
>   # ${with_multilib_list} should not contain whitespaces,
>   # consecutive commas or slashes.
>   if echo "${with_multilib_list}" \
> - | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null; then
> + | grep -E -e "[[:space:]]" -e '[,/][,/]' -e '[,/]$' -e '^[,/]' 
> > /dev/null 2>&1; then
>   echo "Invalid argument to --with-multilib-list." 1>&2
>   exit 1
>   fi
>  
> - unset component idx elem_abi_base elem_abi_ext elem_tmp
> + unset component elem_abi_base elem_abi_ext elem_tmp parse_state 
> all_abis
>   for elem in $(echo "${with_multilib_list}" | tr ',' ' '); do
> - idx=0
> - while true; do
> - idx=$((idx + 1))
> - component=$(echo "${elem}" | awk -F'/' '{print 
> $'"${idx}"'}')
> -
> - case ${idx} in
> - 1)
> - # Component 1: Base ABI type
> + unset elem_abi_base elem_abi_ext
> + parse_state="abi-base"
> +
> + for component in $(echo "${elem}" | tr '/' ' '); do
> + if test x${parse_state} = x"abi-base"; then
> + # Base ABI type
>   case ${component} in
> - lp64d) elem_tmp="ABI_BASE_LP64D,";;
> - lp64f) elem_tmp="ABI_BASE_LP64F,";;
> - lp64s) elem_tmp="ABI_BASE_LP64S,";;
> + lp64d | lp64f | lp64s) 
> elem_tmp="ABI_BASE_$(tr a-z A-Z <<< ${component}),";;

"<<<" isn't portable shell.  Could you try with:

  echo ${component} | tr ...

instead?

As it stands, this causes a bootstrap failure with non-bash shells
such as dash, even on non-Loongson targets.

(Part of me wishes that we'd just standardise on bash.  But since that
isn't the policy, I sometimes use dash to pick up my own lapses.)

Thanks,
Richard


Re: [PATCH 10/11] aarch64: Fix branch-protection error message tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> Update tests for the new branch-protection parser errors.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/branch-protection-attr.c: Update.
>   * gcc.target/aarch64/branch-protection-option.c: Update.

OK, thanks.  (And I agree these are better messages. :))

I think that's the last of the AArch64-specific ones.  The others
will need to be reviewed by Kyrill or Richard.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c   | 6 +++---
>  gcc/testsuite/gcc.target/aarch64/branch-protection-option.c | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> index 272000c2747..dae2a758a56 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-attr.c
> @@ -4,19 +4,19 @@ void __attribute__ ((target("branch-protection=leaf")))
>  foo1 ()
>  {
>  }
> -/* { dg-error {invalid protection type 'leaf' in 
> 'target\("branch-protection="\)' pragma or attribute} "" { target *-*-* } 5 } 
> */
> +/* { dg-error {invalid argument 'leaf' for 'target\("branch-protection="\)'} 
> "" { target *-*-* } 5 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection=leaf"\)' is 
> not valid} "" { target *-*-* } 5 } */
>  
>  void __attribute__ ((target("branch-protection=none+pac-ret")))
>  foo2 ()
>  {
>  }
> -/* { dg-error "unexpected 'pac-ret' after 'none'" "" { target *-*-* } 12 } */
> +/* { dg-error {argument 'none' can only appear alone in 
> 'target\("branch-protection="\)'} "" { target *-*-* } 12 } */
>  /* { dg-error {pragma or attribute 
> 'target\("branch-protection=none\+pac-ret"\)' is not valid} "" { target *-*-* 
> } 12 } */
>  
>  void __attribute__ ((target("branch-protection=")))
>  foo3 ()
>  {
>  }
> -/* { dg-error {missing argument to 'target\("branch-protection="\)' pragma 
> or attribute} "" { target *-*-* } 19 } */
> +/* { dg-error {invalid argument '' for 'target\("branch-protection="\)'} "" 
> { target *-*-* } 19 } */
>  /* { dg-error {pragma or attribute 'target\("branch-protection="\)' is not 
> valid} "" { target *-*-* } 19 } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c 
> b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> index 1b3bf4ee2b8..e2f847a31c4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> +++ b/gcc/testsuite/gcc.target/aarch64/branch-protection-option.c
> @@ -1,4 +1,4 @@
>  /* { dg-do "compile" } */
>  /* { dg-options "-mbranch-protection=leaf -mbranch-protection=none+pac-ret" 
> } */
>  
> -/* { dg-error "unexpected 'pac-ret' after 'none'"  "" { target *-*-* } 0 } */
> +/* { dg-error "argument 'none' can only appear alone in 
> '-mbranch-protection='" "" { target *-*-* } 0 } */


Re: [PATCH 07/11] aarch64: Disable branch-protection for pcs tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> The tests manipulate the return address in abitest-2.h and thus not
> compatible with -mbranch-protection=pac-ret+leaf or
> -mbranch-protection=gcs.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/aapcs64/func-ret-1.c: Disable branch-protection.
>   * gcc.target/aarch64/aapcs64/func-ret-2.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-3.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-4.c: Likewise.
>   * gcc.target/aarch64/aapcs64/func-ret-64x1_1.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c  | 1 +
>  gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c | 1 +
>  5 files changed, 5 insertions(+)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> index 5405e1e4920..7bd7757efe6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-1.c
> @@ -4,6 +4,7 @@
> AAPCS64 \S 4.1.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> index 6b171c46fbb..85a822ace4a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-2.c
> @@ -4,6 +4,7 @@
> Homogeneous floating-point aggregate types are covered in func-ret-3.c.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> index ad312b675b9..1d35ebf14b4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-3.c
> @@ -4,6 +4,7 @@
> in AAPCS64 \S 4.3.5.  */
>  
>  /* { dg-do run { target aarch64-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> index af05fbe9fdf..15e1408c62d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-4.c
> @@ -5,6 +5,7 @@
> are treated as general composite types.  */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  /* { dg-require-effective-target aarch64_big_endian } */
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c 
> b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> index 05957e2dcae..fe7bbb6a835 100644
> --- a/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/aapcs64/func-ret-64x1_1.c
> @@ -3,6 +3,7 @@
>Test 64-bit singleton vector types which should be in FP/SIMD registers.  
> */
>  
>  /* { dg-do run { target aarch64*-*-* } } */
> +/* { dg-additional-options "-mbranch-protection=none" } */
>  /* { dg-additional-sources "abitest.S" } */
>  
>  #ifndef IN_FRAMEWORK


Re: [PATCH 06/11] aarch64: Fix pac-ret eh_return tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> This is needed since eh_return no longer prevents pac-ret in the
> normal return path.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/return_address_sign_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_2.c: ... here and fix the
>   scan asm check.
>   * gcc.target/aarch64/return_address_sign_b_1.c: Move func4 to ...
>   * gcc.target/aarch64/return_address_sign_b_2.c: ... here and fix the
>   scan asm check.
> ---
>  .../gcc.target/aarch64/return_address_sign_1.c  | 13 +
>  .../gcc.target/aarch64/return_address_sign_2.c  | 17 +++--
>  .../aarch64/return_address_sign_b_1.c   | 11 ---
>  .../aarch64/return_address_sign_b_2.c   | 17 +++--
>  4 files changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> index 232ba67ade0..114a9dacb3f 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autiasp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no paciasp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autiasp */
> -  return;
> -}
> -
> -/* { dg-final { scan-assembler-times "autiasp" 3 } } */
>  /* { dg-final { scan-assembler-times "paciasp" 3 } } */
> +/* { dg-final { scan-assembler-times "autiasp" 3 } } */

I suppose there is no normal return path here.  I don't know how quickly
we'd realise that though, in the sense that the flag register becomes known-1.
But a quick-and-dirty check would be whether the exit block has a single
predecessor, which in a function that calls eh_return should mean
that the eh_return is unconditional.

But that might not be worth worrying about, given the builtin's limited
use case.  And even if it is worth worrying about, keeping the test in
this file would mix correctness with optimisation, which isn't a good
thing for scan-assembler-times.

So yeah, I agree this is OK.  It should probably be part of 03 though,
so that no individual commit causes a regression.

Thanks,
Richard

> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> index a4bc5b45333..d93492c3c43 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retaa */
>  }
>  
> -/* { dg-final { scan-assembler-times "paciasp" 1 } } */
> -/* { dg-final { scan-assembler-times "retaa" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retaa */
> +  return;
> +}
> +
> +/* { dg-final { scan-assembler-times "paciasp" 2 } } */
> +/* { dg-final { scan-assembler-times "retaa" 2 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> index 43e32ab6cb7..697fa30dc5a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_1.c
> @@ -37,16 +37,5 @@ func3 (int a, int b, int c)
>/* autibsp */
>  }
>  
> -/* eh_return.  */
> -void __attribute__ ((target ("arch=armv8.3-a")))
> -func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> -{
> -  /* no pacibsp */
> -  *ptr = imm1 + foo (imm1) + imm2;
> -  __builtin_eh_return (offset, handler);
> -  /* no autibsp */
> -  return;
> -}
> -
>  /* { dg-final { scan-assembler-times "pacibsp" 3 } } */
>  /* { dg-final { scan-assembler-times "autibsp" 3 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c 
> b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> index 9ed64ce0591..748924c72f3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/return_address_sign_b_2.c
> @@ -14,5 +14,18 @@ func1 (int a, int b, int c)
>/* retab */
>  }
>  
> -/* { dg-final { scan-assembler-times "pacibsp" 1 } } */
> -/* { dg-final { scan-assembler-times "retab" 1 } } */
> +/* eh_return.  */
> +void __attribute__ ((target ("arch=armv8.3-a")))
> +func4 (long offset, void *handler, int *ptr, int imm1, int imm2)
> +{
> +  /* paciasp */
> +  *ptr = imm1 + foo (imm1) + imm2;
> +  if (handler)
> +/* br */
> +__builtin_eh_return (offset, handler);
> +  /* retab */
> +  return;
> +}
> +
> +/* { dg-final { 

Re: [PATCH 05/11] aarch64: Add eh_return compile tests

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/eh_return-2.c: New test.
>   * gcc.target/aarch64/eh_return-3.c: New test.

OK.

I wonder if it's worth using check-function-bodies for -3.c though.
It would then be easy to verify that the autiasp only occurs on the
normal return path.

Just a suggestion -- the current test is fine too.

Thanks,
Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/eh_return-2.c |  9 +
>  gcc/testsuite/gcc.target/aarch64/eh_return-3.c | 14 ++
>  2 files changed, 23 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/eh_return-3.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-2.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> new file mode 100644
> index 000..4a9d124e891
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-2.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +
> +void
> +foo (unsigned long off, void *handler)
> +{
> +  __builtin_eh_return (off, handler);
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return-3.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> new file mode 100644
> index 000..35989eee806
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return-3.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=pac-ret+leaf" } */
> +/* { dg-final { scan-assembler "add\tsp, sp, x5" } } */
> +/* { dg-final { scan-assembler "br\tx6" } } */
> +/* { dg-final { scan-assembler "hint\t25 // paciasp" } } */
> +/* { dg-final { scan-assembler "hint\t29 // autiasp" } } */
> +
> +void
> +foo (unsigned long off, void *handler, int c)
> +{
> +  if (c)
> +return;
> +  __builtin_eh_return (off, handler);
> +}


Re: [PATCH 04/11] aarch64: Do not force a stack frame for EH returns

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> EH returns no longer rely on clobbering the return address on the stack
> so forcing a stack frame is not necessary.
>
> This does not actually change the code gen for the unwinder since there
> are calls before the EH return.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_needs_frame_chain): Do not
>   force frame chain for eh_return.

OK once we've agreed on something for 03/11.

Thanks,
Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 36cd172d182..afdbf4213c1 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -8417,8 +8417,7 @@ aarch64_output_probe_sve_stack_clash (rtx base, rtx 
> adjustment,
>  static bool
>  aarch64_needs_frame_chain (void)
>  {
> -  /* Force a frame chain for EH returns so the return address is at FP+8.  */
> -  if (frame_pointer_needed || crtl->calls_eh_return)
> +  if (frame_pointer_needed)
>  return true;
>  
>/* A leaf function cannot have calls or write LR.  */


Re: [PATCH 01/11] aarch64: AARCH64_ISA_RCPC was defined twice

2023-09-05 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.h (AARCH64_ISA_RCPC): Remove dup.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.h | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 2b0fc97bb71..c783cb96c48 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -222,7 +222,6 @@ enum class aarch64_feature : unsigned char {
>  #define AARCH64_ISA_MOPS(aarch64_isa_flags & AARCH64_FL_MOPS)
>  #define AARCH64_ISA_LS64(aarch64_isa_flags & AARCH64_FL_LS64)
>  #define AARCH64_ISA_CSSC(aarch64_isa_flags & AARCH64_FL_CSSC)
> -#define AARCH64_ISA_RCPC   (aarch64_isa_flags & AARCH64_FL_RCPC)
>  
>  /* Crypto is an optional extension to AdvSIMD.  */
>  #define TARGET_CRYPTO (AARCH64_ISA_CRYPTO)


Re: testsuite: Port 'check-function-bodies' to nvptx

2023-09-05 Thread Richard Sandiford via Gcc-patches
Thomas Schwinge  writes:
> Hi!
>
> On 2023-09-04T23:05:05+0200, I wrote:
>> On 2019-07-16T15:04:49+0100, Richard Sandiford  
>> wrote:
>>> This patch therefore adds a new check-function-bodies dg-final test
>
>>> The regexps in parse_function_bodies are fairly general, but might
>>> still need to be extended in future for targets like Darwin or AIX.
>>
>> ..., or nvptx.  [...]
>
>> number of TODO items.
>>
>> In particular how to parameterize regular expressions for the different
>> syntax used by nvptx: for example, parameterize via global variables,
>> initialized accordingly (where?)?  Thinking about it, maybe simply
>> conditionalizing the current local initializations by
>> 'if { [istarget nvptx-*-*] } { [...] } else { [...] }' will do, simple
>> enough!
>
> Indeed that works fine.
>
>> Regarding whitespace prefixed, I think I'll go with the current
>> 'append function_regexp "\t" $line "\n"', that is, prefix expected output
>> lines with '\t' (as done in 'gcc.target/nvptx/abort.c'), and also for
>> nvptx handle labels as "fluff" (until we solve that issue generally).
>
> I changed my mind about that: instead of '\t', use '\t*' for nvptx, which
> means that both instructions emitted with additional whitespace prefixed
> and labels in column zero work nicely.
>
>> --- a/gcc/testsuite/lib/scanasm.exp
>> +++ b/gcc/testsuite/lib/scanasm.exp
>
>> @@ -907,7 +911,8 @@ proc check-function-bodies { args } {
>>
>>  set count 0
>>  set function_regexp ""
>> -set label {^(\S+):$}
>> +#TODO
>> +set label {^// BEGIN GLOBAL FUNCTION DEF: ([a-zA-Z_]\S+)$}
>
> There's actually no reason that the expected output syntax (this one) has
> to match the assembly -- so I restored that, to use the same syntax for
> nvptx here, too.
>
> Any comments before I push the attached
> "testsuite: Port 'check-function-bodies' to nvptx"?
>
>
> Grüße
>  Thomas
>
>
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
> München, HRB 106955
>
> From bdaf7572d9d4c1988274405840de4071ded3733f Mon Sep 17 00:00:00 2001
> From: Thomas Schwinge 
> Date: Mon, 4 Sep 2023 22:28:12 +0200
> Subject: [PATCH] testsuite: Port 'check-function-bodies' to nvptx
>
> This extends commit 4d706ff86ea86868615558e92407674a4f4b4af9
> "Add dg test for matching function bodies" for nvptx.
>
>   gcc/testsuite/
>   * lib/scanasm.exp (configure_check-function-bodies): New proc.
>   (parse_function_bodies, check-function-bodies): Use it.
>   * gcc.target/nvptx/abort.c: Use 'check-function-bodies'.
>   gcc/
>   * doc/sourcebuild.texi (check-function-bodies): Update.

LGTM.  Just a minor comment:

> ---
>  gcc/doc/sourcebuild.texi   |  9 ++-
>  gcc/testsuite/gcc.target/nvptx/abort.c | 19 ++-
>  gcc/testsuite/lib/scanasm.exp  | 76 --
>  3 files changed, 83 insertions(+), 21 deletions(-)
>
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 1a78b3c1abb..8aec6b6592c 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -3327,9 +3327,12 @@ The first line of the expected output for a function 
> @var{fn} has the form:
>  Subsequent lines of the expected output also start with @var{prefix}.
>  In both cases, whitespace after @var{prefix} is not significant.
>  
> -The test discards assembly directives such as @code{.cfi_startproc}
> -and local label definitions such as @code{.LFB0} from the compiler's
> -assembly output.  It then matches the result against the expected
> +Depending on the configuration (see
> +@code{gcc/testsuite/lib/scanasm.exp:configure_check-function-bodies}),

I can imagine such a long string wouldn't format well in the output.
How about: @code{configure_check-function-bodies} in
@filename{gcc/testsuite/lib/scanasm.exp}?

OK from my POV with that change.

Thanks,
Richard

> +the test may discard from the compiler's assembly output
> +directives such as @code{.cfi_startproc},
> +local label definitions such as @code{.LFB0}, and more.
> +It then matches the result against the expected
>  output for a function as a single regular expression.  This means that
>  later lines can use backslashes to refer back to @samp{(@dots{})}
>  captures on earlier lines.  For example:
> diff --git a/gcc/testsuite/gcc.target/nvptx/abort.c 
> b/gcc/testsuite/gcc.target/nvptx/abort.c
> index d3220687400..ae9dbf45a9b 100644
> --- a/gcc/testsuite/gcc.target/nvptx/abort.c
> +++ b/gcc/testsuite/gcc.target/nvptx/abort.c
> @@ -1,4 +1,6 @@
>  /* { dg-do compile} */
> +/* { dg-final { check-function-bodies {**} {} } } */
> +
>  /* Annotate no return functions with a trailing 'trap'.  */
>  
>  extern void abort ();
> @@ -9,5 +11,18 @@ int main (int argc, char **argv)
>  abort ();
>return 0;
>  }
> -
> -/* { dg-final { scan-assembler "call 

Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-09-05 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> So I don't think I have a good feel for the advantages and disadvantages
>> of doing this.  Robin's analysis of the aarch64 changes was nice and
>> detailed though.  I think the one that worries me most is the addressing
>> mode one.  fwprop is probably the first chance we get to propagate adds
>> into addresses, and virtual register elimination means that some of
>> those opportunities won't show up in gimple.
>> 
>> There again, virtual register elimination wouldn't be the reason for
>> the ld4_s8.c failure.  Perhaps there's something missing in expand.
>> 
>> Other than that, I think my main question is: why just unary operations?
>> Is the underlying assumption that we only want to propagate a maximum of
>> one register?  If so, then I think we should check for that directly, by
>> iterating over subrtxes.
>
> The main reason for stopping at unary operations was to limit the scope
> and change as little as possible (not restricting the change to one
> register).  I'm currently testing a v2 that iterates over subrtxs.

Thanks.  Definitely no problem with doing things in small steps, but IMO
it's better if each choice of step can still be justified in its own terms.

>> Perhaps we should allow the optimisation without register-pressure
>> information if (a) the source register and destination register are
>> in the same pressure class and (b) all uses of the destination are
>> being replaced.  (FWIW, rtl-ssa should make it easier to try to
>> replace all definitions at once, with an all-or-nothing choice,
>> if we ever wanted to do that.)
>
> I presume you're referring to replacing one register (dest) in all using
> insns?  Source and destination are somewhat overloaded in fwprop context
> because I'm thinking of the "to be replaced" register as dest when it's
> actually the replacement register.

Yeah.

> AFAICT fwprop currently iterates over insns, going through all their uses
> and trying if an individual use can be substituted.  Do you suggest to
> change this general iteration order to iterate over the defs of an insn
> and then try to replace all the uses at once (e.g. using ssa->change_insns)?

No, I was just noting in passing that we could try do that if we wanted to.
The current code is a fairly mechanical conversion of the original DF-based
code, but there's no reason why it has to continue to work the way it
does now.

> When keeping the current order, wouldn't we need to store all potential
> changes instead of committing them and later apply them in bulk, e.g.
> grouped by use?  This order would also help to pick the propagation
> with the most number of uses (i.e. propagation potential) but maybe
> I'm misunderstanding?

I imagine doing it in reverse postorder would still make sense.

But my point was that, for the current fwprop limitation of substituting
into exactly one use of a register, we can check whether that use is
the *only* use of register.

I.e. if we substitute:

  A: (set (reg R1) (foo (reg R2)))

into:

  B: (set ... (reg R1) ...)

if R1 and R2 are likely to be in the same register class, and if B
is the only user of R2, then we don't need to calculate register
pressure.  The change is either neutral (if R2 died in A) or an
improvement (if R2 doesn't die in A, and so R1 and R2 were previously
live at the same time).

Thanks,
Richard


Re: RFC: Introduce -fhardened to enable security-related flags

2023-09-04 Thread Richard Sandiford via Gcc-patches
Qing Zhao via Gcc-patches  writes:
>> On Aug 29, 2023, at 3:42 PM, Marek Polacek via Gcc-patches 
>>  wrote:
>> 
>> Improving the security of software has been a major trend in the recent
>> years.  Fortunately, GCC offers a wide variety of flags that enable extra
>> hardening.  These flags aren't enabled by default, though.  And since
>> there are a lot of hardening flags, with more to come, it's been difficult
>> to keep on top of them; more so for the users of GCC who ought not to be
>> expected to keep track of all the new options.
>> 
>> To alleviate some of the problems I mentioned, we thought it would
>> be useful to provide a new umbrella option that enables a reasonable set
>> of hardening flags.  What's "reasonable" in this context is not easy to
>> pin down.  Surely, there must be no ABI impact, the option cannot cause
>> severe performance issues, and, I suspect, it should not cause build
>> errors by enabling stricter compile-time errors (such as, -Wimplicit-int,
>> -Wint-conversion).  Including a controversial option in -fhardened
>> would likely cause that users would not use -fhardened at all.  It's
>> roughly akin to -Wall or -O2 -- those also enable a reasonable set of
>> options, and evolve over time, and are not kept in sync with other
>> compilers.
>> 
>> Currently, -fhardened enables:
>> 
>>  -D_FORTIFY_SOURCE=3 (or =2 for older glibcs)
>>  -D_GLIBCXX_ASSERTIONS
>>  -ftrivial-auto-var-init=zero
>>  -fPIE  -pie  -Wl,-z,relro,-z,now
>>  -fstack-protector-strong
>>  -fstack-clash-protection
>>  -fcf-protection=full (x86 GNU/Linux only)
>> 
>> -fsanitize=undefined is specifically not enabled.  -fstrict-flex-arrays is
>> also liable to break a lot of code so I didn't include it.
>> 
>> Appended is a proof-of-concept patch.  It doesn't implement --help=hardened
>> yet.  A fairly crucial point is that -fhardened will not override options
>> that were specified on the command line (before or after -fhardened).  For
>> example,
>> 
>> -D_FORTIFY_SOURCE=1 -fhardened
>> 
>> means that _FORTIFY_SOURCE=1 will be used.  Similarly,
>> 
>>  -fhardened -fstack-protector
>> 
>> will not enable -fstack-protector-strong.
>> 
>> Thoughts?
>
> In general, I think that it is a very good idea to provide umbrella options
>  for software security purpose.  Thanks a lot for this work!
>
> 1. I do agree with Martin, multiple-level control for this purpose might be 
> needed,
> similar as multiple levels for warnings, and multiple levels for 
> optimizations.
>
> Similar as optimization options, can we organize all the security options 
> together 
> In our manual, then the user will have a good central place to get more and 
> complete
> Information of the security features our compiler provides? 
>
> 2. What’s the major criteria to decide which security feature should go into 
> this list?
> Later, when we have new security features, how to decide whether to add them 
> to
> This list or not?
> I am wondering why -fzero-call-used-regs is not included in the list and also

FWIW, I wondered the same thing.  Not a strong conviction that it should
be included -- maybe the code bloat is too much on some targets.  But it
might be acceptable for the -fhardened equivalent of -O3, at least if
restricted to GPRs.
 
> Why chose -ftrivial-auto-var-init=zero instead of 
> -ftrivial-auto-var-init=pattern? 

Yeah, IIRC -ftrivial-auto-var-init=zero was controversial with some
Clang maintainers because it effectively creates a language dialect.
-ftrivial-auto-var-init=pattern wasn't controversial in the same way.

Thanks,
Richard


Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> "yanzhang.wang--- via Gcc-patches"  writes:
>> From: Yanzhang Wang 
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
>> to not.
>>
>> Signed-off-by: Yanzhang Wang 
>> ---
>>
>> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
>> the code is the same with that.
>>
>>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> index b9615de6655..1cf6916a5e0 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
>> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>>  
>>  /*
>>  ** subr_m1_s8_m:
>> -**  mov (z[0-9]+\.b), #-1
>> -**  subrz0\.b, p0/m, z0\.b, \1
>> +**  not z0.b, p0/m, z0.b
>>  **  ret
>>  */
>>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
>
> I think we need this for subr_u8.c too.  OK with that change,
> and thanks for the fix!

Actually, never mind.  I just saw a patch from Thiago Jung Bauerman
for the same issue, which is now in trunk.  Sorry for the confusion,
and thanks again for posting the fix.

Richard


Re: [PATCH] testsuite: aarch64: Adjust SVE ACLE tests to new generated code

2023-09-04 Thread Richard Sandiford via Gcc-patches
Thiago Jung Bauermann via Gcc-patches  writes:
> Since commit e7a36e4715c7 "[PATCH] RISC-V: Support simplify (-1-x) for
> vector." these tests fail on aarch64-linux:
>
>   === g++ tests ===
>
> Running g++:g++.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu++98 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
>   === gcc tests ===
>
> Running gcc:gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp ...
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_s8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_s8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_FULL  
> check-function-bodies subr_m1_u8_m
> FAIL: gcc.target/aarch64/sve/acle/asm/subr_u8.c -std=gnu90 -O2 
> -fno-schedule-insns -DCHECK_ASM --save-temps -DTEST_OVERLOADS  
> check-function-bodies subr_m1_u8_m
>
> Andrew Pinski's analysis in PR testsuite/111071 is that the new code is
> better and the testcase should be updated. I also asked Prathamesh Kulkarni
> in private and he agreed.
>
> Here is the update. With this change, all tests in
> gcc.target/aarch64/sve/acle/aarch64-sve-acle-asm.exp pass on aarch64-linux.
>
> gcc/testsuite/
>   PR testsuite/111071
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c: Adjust to 
> new code.
>   * gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c: Likewise.

Thanks, pushed to trunk.  And sorry for the delay.  I somehow
missed this earlier. :(

Richard

> Suggested-by: Andrew Pinski 
> ---
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655f..3e521bc9ae32 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> index 65606b6dda03..4922bdbacc47 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_u8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_u8_m_untied, svuint8_t,
>  
>  /*
>  ** subr_m1_u8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0\.b, p0/m, z0\.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_u8_m, svuint8_t,


Re: [PATCH v3] mklog: handle Signed-off-by, minor cleanup

2023-09-04 Thread Richard Sandiford via Gcc-patches
Marc Poulhiès via Gcc-patches  writes:
> Richard Sandiford via Gcc-patches  writes:
>>> +# this regex matches the first line of the "end" in the initial commit 
>>> message
>>> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by|co-authored-by|#): 
>>> ')
>>
>> The current code only requires an initial "#", rather than an initial "#: ".
>> Is that a deliberate change?
>>
>> The patch LGTM apart from that.
>
> Hello Richard,
>
> Thanks for the review and sorry for the delayed answer as I was away the
> past weeks. This issue was catched early this month
> (https://github.com/Rust-GCC/gccrs/pull/2504), but I didn't want to send
> something here before leaving. Here's a fixed patched.
>
> Ok for master?
>
> Thanks,
> Marc
>
> ---
>  contrib/mklog.py   | 34 +-
>  contrib/prepare-commit-msg | 20 ++--
>  2 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/contrib/mklog.py b/contrib/mklog.py
> index 26230b9b4f2..496780883fb 100755
> --- a/contrib/mklog.py
> +++ b/contrib/mklog.py
> @@ -41,7 +41,34 @@ from unidiff import PatchSet
>  
>  LINE_LIMIT = 100
>  TAB_WIDTH = 8
> -CO_AUTHORED_BY_PREFIX = 'co-authored-by: '
> +
> +# Initial commit:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  | This is the "start"
> +#   | This is some text explaining the commit. |
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +#
> +# Results in:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  |
> +#   | This is some text explaining the commit. | This is the "start"
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | gcc/rust/ChangeLog:  |
> +#   |  | This is the 
> generated
> +#   | * some_file (bla):   | ChangeLog part
> +#   | (foo):   |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +
> +# this regex matches the first line of the "end" in the initial commit 
> message
> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by:|co-authored-by:|#) ')

Personally I think it would be safer to drop the final space in the regexp.

OK with that change if you agree.

Thanks,
Richard

>  
>  pr_regex = re.compile(r'(\/(\/|\*)|[Cc*!])\s+(?PPR [a-z+-]+\/[0-9]+)')
>  prnum_regex = re.compile(r'PR (?P[a-z+-]+)/(?P[0-9]+)')
> @@ -330,10 +357,7 @@ def update_copyright(data):
>  
>  
>  def skip_line_in_changelog(line):
> -if line.lower().startswith(CO_AUTHORED_BY_PREFIX) or 
> line.startswith('#'):
> -return False
> -return True
> -
> +return FIRST_LINE_OF_END_RE.match(line) == None
>  
>  if __name__ == '__main__':
>  extra_args = os.getenv('GCC_MKLOG_ARGS')
> diff --git a/contrib/prepare-commit-msg b/contrib/prepare-commit-msg
> index 48c9dad3c6f..1e94706ba40 100755
> --- a/contrib/prepare-commit-msg
> +++ b/contrib/prepare-commit-msg
> @@ -32,11 +32,11 @@ if ! [ -f "$COMMIT_MSG_FILE" ]; then exit 0; fi
>  # Don't do anything unless requested to.
>  if [ -z "$GCC_FORCE_MKLOG" ]; then exit 0; fi
>  
> -if [ -z "$COMMIT_SOURCE" ] || [ $COMMIT_SOURCE = template ]; then
> +if [ -z "$COMMIT_SOURCE" ] || [ "$COMMIT_SOURCE" = template ]; then
>  # No source or "template" means new commit.
>  cmd="diff --cached"
>  
> -elif [ $COMMIT_SOURCE = message ]; then
> +elif [ "$COMMIT_SOURCE" = message ]; then
>  # "message" means -m; assume a new commit if there are any changes 
> staged.
>  if ! git diff --cached --quiet; then
>   cmd="diff --cached"
> @@ -44,23 +44,23 @@ elif [ $COMMIT_SOURCE = message ]; then
>   cmd="diff --cached HEAD^"
>  fi
>  
> -elif [ $COMMIT_SOURCE = commit ]; t

Re: [PATCH] testsuite: Remove unwanted 'dg-do run' from gcc.dg/vect tests

2023-09-04 Thread Richard Sandiford via Gcc-patches
Christophe Lyon via Gcc-patches  writes:
> Tests under gcc.dg/vect use check_vect_support_and_set_flags to set
> compilation flags as appropriate for the target, but they also set
> dg-do-what-default to 'run' or 'compile', depending on the actual
> target hardware (or simulator) capabilities.
>
> For instance on arm, we use options to enable Neon, but set
> dg-do-what-default to 'run' only if we cam actually execute Neon
> instructions.
>
> Therefore, we would always try to link and execute tests containing
> 'dg-do run', although dg-do-what-default says otherwise, leading to
> uninteresting failures.
>
> Therefore, this patch removes all such unconditionnal 'dg-do run',
> thus avoid link errors for instance if GCC has been configured with
> multilibs disabled and some --with-{float|cpu|hard} option
> incompatible with what check_vect_support_and_set_flags selects.
>
> For exmaple, GCC configured with:
> --disable-multilib --with-mode=thumb --with-cpu=cortex-m7 --with-float=hard
> and check_vect_support_and_set_flags uses
> -mfpu=neon -mfloat-abi=softfp -march=armv7-a
> (thus incompatible float-abi options)
>
> Tested on native aarch64-linux-gnu (no change) and several arm-eabi
> cases where the FAIL/UNRESOLVED disappear (and we keep only the
> 'compilation' tests).
>
> 2023-09-04  Christophe Lyon  
>
>   gcc/testsuite/
>   * gcc.dg/vect/bb-slp-44.c: Remove 'dg-do run'.
>   * gcc.dg/vect/bb-slp-71.c: Likewise.
>   * gcc.dg/vect/bb-slp-72.c: Likewise.
>   * gcc.dg/vect/bb-slp-73.c: Likewise.
>   * gcc.dg/vect/bb-slp-74.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101207.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101615-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr101668.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr54400.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-1.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98516-2.c: Likewise.
>   * gcc.dg/vect/bb-slp-pr98544.c: Likewise.
>   * gcc.dg/vect/pr101445.c: Likewise.
>   * gcc.dg/vect/pr105219.c: Likewise.
>   * gcc.dg/vect/pr107160.c: Likewise.
>   * gcc.dg/vect/pr107212-1.c: Likewise.
>   * gcc.dg/vect/pr107212-2.c: Likewise.
>   * gcc.dg/vect/pr109502.c: Likewise.
>   * gcc.dg/vect/pr110381.c: Likewise.
>   * gcc.dg/vect/pr110838.c: Likewise.
>   * gcc.dg/vect/pr88497-1.c: Likewise.
>   * gcc.dg/vect/pr88497-7.c: Likewise.
>   * gcc.dg/vect/pr96783-1.c: Likewise.
>   * gcc.dg/vect/pr96783-2.c: Likewise.
>   * gcc.dg/vect/pr97558-2.c: Likewise.
>   * gcc.dg/vect/pr99253.c: Likewise.
>   * gcc.dg/vect/slp-mask-store-1.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-10.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-11.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-2.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-3.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-4.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-5.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-6.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-8.c: Likewise.
>   * gcc.dg/vect/vect-bic-bitmask-9.c: Likewise.
>   * gcc.dg/vect/vect-cond-13.c: Likewise.
>   * gcc.dg/vect/vect-recurr-1.c: Likewise.
>   * gcc.dg/vect/vect-recurr-2.c: Likewise.
>   * gcc.dg/vect/vect-recurr-3.c: Likewise.
>   * gcc.dg/vect/vect-recurr-4.c: Likewise.
>   * gcc.dg/vect/vect-recurr-5.c: Likewise.
>   * gcc.dg/vect/vect-recurr-6.c: Likewise.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-44.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-71.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-72.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-73.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-74.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101207.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101615-2.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101668.c | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c  | 1 -
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-1.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98516-2.c| 2 --
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr98544.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr101445.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr105219.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr107160.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-1.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr107212-2.c  | 2 --
>  gcc/testsuite/gcc.dg/vect/pr109502.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110381.c| 1 -
>  gcc/testsuite/gcc.dg/vect/pr110838.c| 2 --
>  gcc/testsuite/gcc.dg/vect/pr88497-1.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr88497-7.c   | 1 -
>  gcc/testsuite/gcc.dg/vect/pr96783-1.c   | 2 --
>  gcc/testsuite/gcc.dg/vect/pr96783-2.c   | 2 --
>  

Re: [PATCH] Bug 111071: fix the subr with -1 to not due to the simplify.

2023-09-04 Thread Richard Sandiford via Gcc-patches
"yanzhang.wang--- via Gcc-patches"  writes:
> From: Yanzhang Wang 
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/sve/acle/asm/subr_s8.c: Modify subr with -1
> to not.
>
> Signed-off-by: Yanzhang Wang 
> ---
>
> Tested on my local arm environment and passed. Thanks Andrew Pinski's comment
> the code is the same with that.
>
>  gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> index b9615de6655..1cf6916a5e0 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/subr_s8.c
> @@ -76,8 +76,7 @@ TEST_UNIFORM_Z (subr_1_s8_m_untied, svint8_t,
>  
>  /*
>  ** subr_m1_s8_m:
> -**   mov (z[0-9]+\.b), #-1
> -**   subrz0\.b, p0/m, z0\.b, \1
> +**   not z0.b, p0/m, z0.b
>  **   ret
>  */
>  TEST_UNIFORM_Z (subr_m1_s8_m, svint8_t,

I think we need this for subr_u8.c too.  OK with that change,
and thanks for the fix!

Richard


Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> From: Richard Sandiford 
>> Sent: Friday, September 1, 2023 2:36 PM
>> To: Tamar Christina 
>> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
>> ; Marcus Shawcroft
>> ; Kyrylo Tkachov 
>> Subject: Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering
>> 
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > In GCC-9 our scalar xorsign pattern broke and we didn't notice it
>> > because the testcase was not strong enough.  With this commit
>> >
>> > 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
>> > commit 8d2d39587d941a40f25ea0144cceb677df115040
>> > Author: Segher Boessenkool 
>> > Date:   Mon Oct 22 22:23:39 2018 +0200
>> >
>> > combine: Do not combine moves from hard registers
>> >
>> > combine started introducing useless moves on hard registers,  when one
>> > of the arguments to our scalar xorsign is a hardreg we get an additional 
>> > move
>> inserted.
>> >
>> > This leads to combine forming an AND with the immediate inside and
>> > using the superflous move to do the r->w move, instead of what we
>> > wanted before which was for the `and` to be a vector and and have reload
>> pick the right alternative.
>> 
>> IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap measure
>> that (like most stop-gap measures) never got cleaned up later.
>> 
>> But that's not important now. :)
>> 
>> > To fix this the patch just forces the use of the vector version
>> > directly and so combine has no chance to mess it up.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>> >(@xorsign3): ...This.
>> >* config/aarch64/aarch64.md (xorsign3): Renamed to...
>> >(@xorsign3): ..This and emit vectors directly
>> >* config/aarch64/iterators.md (VCONQ): Add SF and DF.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >* gcc.target/aarch64/xorsign.c:
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> > b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> >
>> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc2
>> 3746511
>> > 9764ce2a4942 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -500,7 +500,7 @@ (define_expand "ctz2"
>> >}
>> >  )
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:VHSDF 0 "register_operand")
>> > (match_operand:VHSDF 1 "register_operand")
>> > (match_operand:VHSDF 2 "register_operand")] diff --git
>> > a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index
>> >
>> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aede
>> cdc84
>> > 62c94bf1a769 100644
>> > --- a/gcc/config/aarch64/aarch64.md
>> > +++ b/gcc/config/aarch64/aarch64.md
>> > @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>> >  ;; EOR   v0.8B, v0.8B, v3.8B
>> >  ;;
>> >
>> > -(define_expand "xorsign3"
>> > +(define_expand "@xorsign3"
>> >[(match_operand:GPF 0 "register_operand")
>> > (match_operand:GPF 1 "register_operand")
>> > (match_operand:GPF 2 "register_operand")]
>> >"TARGET_SIMD"
>> >  {
>> > -
>> > -  machine_mode imode = mode;
>> > -  rtx mask = gen_reg_rtx (imode);
>> > -  rtx op1x = gen_reg_rtx (imode);
>> > -  rtx op2x = gen_reg_rtx (imode);
>> > -
>> > -  int bits = GET_MODE_BITSIZE (mode) - 1;
>> > -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode
>> (HOST_WIDE_INT_M1U << bits,
>> > -   imode)));
>> > -
>> > -  emit_insn (gen_and3 (op2x, mask,
>> > -  lowpart_subreg (imode, operands[2],
>> > -  mode)));
>> > -  emit_insn (gen_xor3 (op1x,
>> > -  lowpart_subreg (imode, operands[1],
>> > -  mode),
>> > -  op2x));
>> > +  rtx tmp = gen_reg_rtx (mode);  rtx op1 = gen_reg_rtx
>> > + (mode);  rtx op2 = gen_reg_rtx (mode);
>> emit_move_insn
>> > + (op1, lowpart_subreg (mode, operands[1], mode));
>> > + emit_move_insn (op2, lowpart_subreg (mode, operands[2],
>> > + mode));  emit_insn (gen_xorsign3(mode, tmp, op1,
>> op2));
>> 
>> Do we need the extra moves into op1 and op2?  I would have expected the
>> subregs to be acceptable as direct operands of the xorsign3.  Making them
>> direct operands should be better, since there's then less risk of having the
>> same value live in different registers at the same time.
>> 
>
> That was the first thing I tried but it doesn't work because validate_subreg 
> seems
> to have the invariant that you can either change mode between the same size
> or make it paradoxical but not both at the same time.
>
> i.e. it rejects subreg:V2DI (subreg:DI (reg:DF))), and lowpart_subreg folds 
> it to
> 

Re: [PATCH]AArch64 xorsign: Fix scalar xorsign lowering

2023-09-01 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> In GCC-9 our scalar xorsign pattern broke and we didn't notice it because the
> testcase was not strong enough.  With this commit
>
> 8d2d39587d941a40f25ea0144cceb677df115040 is the first bad commit
> commit 8d2d39587d941a40f25ea0144cceb677df115040
> Author: Segher Boessenkool 
> Date:   Mon Oct 22 22:23:39 2018 +0200
>
> combine: Do not combine moves from hard registers
>
> combine started introducing useless moves on hard registers,  when one of the
> arguments to our scalar xorsign is a hardreg we get an additional move 
> inserted.
>
> This leads to combine forming an AND with the immediate inside and using the
> superflous move to do the r->w move, instead of what we wanted before which 
> was
> for the `and` to be a vector and and have reload pick the right alternative.

IMO, the xorsign optab ought to go away.  IIRC it was just a stop-gap
measure that (like most stop-gap measures) never got cleaned up later.

But that's not important now. :)

> To fix this the patch just forces the use of the vector version directly and
> so combine has no chance to mess it up.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (xorsign3): Renamed to..
>   (@xorsign3): ...This.
>   * config/aarch64/aarch64.md (xorsign3): Renamed to...
>   (@xorsign3): ..This and emit vectors directly
>   * config/aarch64/iterators.md (VCONQ): Add SF and DF.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/xorsign.c:
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> f67eb70577d0c2d9911d8c867d38a4d0b390337c..e955691f1be8830efacc237465119764ce2a4942
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -500,7 +500,7 @@ (define_expand "ctz2"
>}
>  )
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:VHSDF 0 "register_operand")
> (match_operand:VHSDF 1 "register_operand")
> (match_operand:VHSDF 2 "register_operand")]
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> 01cf989641fce8e6c3828f6cfef62e101c4142df..9db82347bf891f9bc40aedecdc8462c94bf1a769
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6953,31 +6953,20 @@ (define_insn "copysign3_insn"
>  ;; EOR   v0.8B, v0.8B, v3.8B
>  ;;
>  
> -(define_expand "xorsign3"
> +(define_expand "@xorsign3"
>[(match_operand:GPF 0 "register_operand")
> (match_operand:GPF 1 "register_operand")
> (match_operand:GPF 2 "register_operand")]
>"TARGET_SIMD"
>  {
> -
> -  machine_mode imode = mode;
> -  rtx mask = gen_reg_rtx (imode);
> -  rtx op1x = gen_reg_rtx (imode);
> -  rtx op2x = gen_reg_rtx (imode);
> -
> -  int bits = GET_MODE_BITSIZE (mode) - 1;
> -  emit_move_insn (mask, GEN_INT (trunc_int_for_mode (HOST_WIDE_INT_M1U << 
> bits,
> -  imode)));
> -
> -  emit_insn (gen_and3 (op2x, mask,
> - lowpart_subreg (imode, operands[2],
> - mode)));
> -  emit_insn (gen_xor3 (op1x,
> - lowpart_subreg (imode, operands[1],
> - mode),
> - op2x));
> +  rtx tmp = gen_reg_rtx (mode);
> +  rtx op1 = gen_reg_rtx (mode);
> +  rtx op2 = gen_reg_rtx (mode);
> +  emit_move_insn (op1, lowpart_subreg (mode, operands[1], 
> mode));
> +  emit_move_insn (op2, lowpart_subreg (mode, operands[2], 
> mode));
> +  emit_insn (gen_xorsign3(mode, tmp, op1, op2));

Do we need the extra moves into op1 and op2?  I would have expected the
subregs to be acceptable as direct operands of the xorsign3.  Making
them direct operands should be better, since there's then less risk of
having the same value live in different registers at the same time.

OK with that change if it works.

Also, nit: missing space before "(".

Thanks,
Richard

>emit_move_insn (operands[0],
> -   lowpart_subreg (mode, op1x, imode));
> +   lowpart_subreg (mode, tmp, mode));
>DONE;
>  }
>  )
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 
> 9398d713044433cd89b2a83db5ae7969feb1dcf7..2451d8c2cd8e2da6ac8339eed9bc975cf203fa4c
>  100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1428,7 +1428,8 @@ (define_mode_attr VCONQ [(V8QI "V16QI") (V16QI "V16QI")
>(V4HF "V8HF") (V8HF "V8HF")
>(V2SF "V4SF") (V4SF "V4SF")
>(V2DF "V2DF") (SI   "V4SI")
> -  (HI   "V8HI") (QI   "V16QI")])
> +  (HI   "V8HI") (QI   "V16QI")
> +  (SF   "V4SF") (DF   "V2DF")])
>  
>  

Re: [PATCH 06/13] [APX EGPR] Map reg/mem constraints in inline asm to non-EGPR constraint.

2023-09-01 Thread Richard Sandiford via Gcc-patches
Uros Bizjak via Gcc-patches  writes:
> On Thu, Aug 31, 2023 at 11:18 AM Jakub Jelinek via Gcc-patches
>  wrote:
>>
>> On Thu, Aug 31, 2023 at 04:20:17PM +0800, Hongyu Wang via Gcc-patches wrote:
>> > From: Kong Lingling 
>> >
>> > In inline asm, we do not know if the insn can use EGPR, so disable EGPR
>> > usage by default from mapping the common reg/mem constraint to non-EGPR
>> > constraints. Use a flag mapx-inline-asm-use-gpr32 to enable EGPR usage
>> > for inline asm.
>> >
>> > gcc/ChangeLog:
>> >
>> >   * config/i386/i386.cc (INCLUDE_STRING): Add include for
>> >   ix86_md_asm_adjust.
>> >   (ix86_md_asm_adjust): When APX EGPR enabled without specifying the
>> >   target option, map reg/mem constraints to non-EGPR constraints.
>> >   * config/i386/i386.opt: Add option mapx-inline-asm-use-gpr32.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >   * gcc.target/i386/apx-inline-gpr-norex2.c: New test.
>> > ---
>> >  gcc/config/i386/i386.cc   |  44 +++
>> >  gcc/config/i386/i386.opt  |   5 +
>> >  .../gcc.target/i386/apx-inline-gpr-norex2.c   | 107 ++
>> >  3 files changed, 156 insertions(+)
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-inline-gpr-norex2.c
>> >
>> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> > index d26d9ab0d9d..9460ebbfda4 100644
>> > --- a/gcc/config/i386/i386.cc
>> > +++ b/gcc/config/i386/i386.cc
>> > @@ -17,6 +17,7 @@ You should have received a copy of the GNU General 
>> > Public License
>> >  along with GCC; see the file COPYING3.  If not see
>> >  .  */
>> >
>> > +#define INCLUDE_STRING
>> >  #define IN_TARGET_CODE 1
>> >
>> >  #include "config.h"
>> > @@ -23077,6 +23078,49 @@ ix86_md_asm_adjust (vec , vec & 
>> > /*inputs*/,
>> >bool saw_asm_flag = false;
>> >
>> >start_sequence ();
>> > +  /* TODO: Here we just mapped the general r/m constraints to non-EGPR
>> > +   constraints, will eventually map all the usable constraints in the 
>> > future. */
>>
>> I think there should be some constraint which explicitly has all the 32
>> GPRs, like there is one for just all 16 GPRs (h), so that regardless of
>> -mapx-inline-asm-use-gpr32 one can be explicit what the inline asm wants.
>>
>> Also, what about the "g" constraint?  Shouldn't there be another for "g"
>> without r16..r31?  What about the various other memory
>> constraints ("<", "o", ...)?
>
> I think we should leave all existing constraints as they are, so "r"
> covers only GPR16, "m" and "o" to only use GPR16. We can then
> introduce "h" to instructions that have the ability to handle EGPR.

Yeah.  I'm jumping in without having read the full thread, sorry,
but the current mechanism for handling this is TARGET_MEM_CONSTRAINT
(added for s390).  That is, TARGET_MEM_CONSTRAINT can be defined to some
new constraint that is more general than the traditional "m" constraint.
This constraint is then the one that is associated with memory_operand
etc.  "m" can then be defined explicitly to the old definition,
so that existing asms continue to work.

So if the port wants generic internal memory addresses to use the
EGPR set (sounds reasonable), then TARGET_MEM_CONSTRAINT would be
a new constraint that maps to those addresses.

Thanks,
Richard


Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-09-01 Thread Richard Sandiford via Gcc-patches
Robin Dapp via Gcc-patches  writes:
>> It's not just a question of which byte though.  It's also a question
>> of which bit.
>> 
>> One option would be to code-generate for even X and for odd X, and select
>> between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.
>> 
>> Otherwise I think we need to treat the bit position as a variable,
>> with bitpos % 8 and bitpos / 8 being calculated at runtime.
>
> Thanks.  I worked around it with a backend vec_extractQI expander
> so we don't run into that situation directly anymore.  The problem is of
> course still latent and I'm going to look at it again after some other things
> on my plate.

Yeah, sounds like a good workaround.  If the target has an efficient way
of coping with the VLAness then the optab will probably be better than
whatever the generic code ends up being.

Thanks,
Richard


[PATCH] lra: Avoid unfolded plus-0

2023-08-31 Thread Richard Sandiford via Gcc-patches
While backporting another patch to an earlier release, I hit a
situation in which lra_eliminate_regs_1 would eliminate an address to:

(plus (reg:P R) (const_int 0))

This address compared not-equal to plain:

(reg:P R)

which caused an ICE in a later peephole2.  (The ICE showed up in
gfortran.fortran-torture/compile/pr80464.f90 on the branch but seems
to be latent on trunk.)

These unfolded PLUSes shouldn't occur in the insn stream, and later code
in the same function tried to avoid them.

Tested on aarch64-linux-gnu so far, but I'll test on x86_64-linux-gnu too.
Does this look OK?

There are probably other instances of the same thing elsewhere,
but it seemed safer to stick to the one that caused the issue.

Thanks,
Richard


gcc/
* lra-eliminations.cc (lra_eliminate_regs_1): Use simplify_gen_binary
rather than gen_rtx_PLUS.
---
 gcc/lra-eliminations.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/lra-eliminations.cc b/gcc/lra-eliminations.cc
index df613cdda76..4daaff1a124 100644
--- a/gcc/lra-eliminations.cc
+++ b/gcc/lra-eliminations.cc
@@ -406,7 +406,7 @@ lra_eliminate_regs_1 (rtx_insn *insn, rtx x, machine_mode 
mem_mode,
elimination_fp2sp_occured_p = true;
 
  if (! update_p && ! full_p)
-   return gen_rtx_PLUS (Pmode, to, XEXP (x, 1));
+   return simplify_gen_binary (PLUS, Pmode, to, XEXP (x, 1));
 
  if (maybe_ne (update_sp_offset, 0))
offset = ep->to_rtx == stack_pointer_rtx ? update_sp_offset : 0;
-- 
2.25.1



[PATCH] aarch64: Fix return register handling in untyped_call

2023-08-31 Thread Richard Sandiford via Gcc-patches
While working on another patch, I hit a problem with the aarch64
expansion of untyped_call.  The expander emits the usual:

  (set (mem ...) (reg resN))

instructions to store the result registers to memory, but it didn't
say in RTL where those resN results came from.  This eventually led
to a failure of gcc.dg/torture/stackalign/builtin-return-2.c,
via regrename.

This patch turns the untyped call from a plain call to a call_value,
to represent that the call returns (or might return) a useful value.
The patch also uses a PARALLEL return rtx to represent all the possible
return registers.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.md (untyped_call): Emit a call_value
rather than a call.  List each possible destination register
in the call pattern.
---
 gcc/config/aarch64/aarch64.md | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 01cf989641f..6f7827bd8c9 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1170,9 +1170,27 @@ (define_expand "untyped_call"
 {
   int i;
 
+  /* Generate a PARALLEL that contains all of the register results.
+ The offsets are somewhat arbitrary, since we don't know the
+ actual return type.  The main thing we need to avoid is having
+ overlapping byte ranges, since those might give the impression
+ that two registers are known to have data in common.  */
+  rtvec rets = rtvec_alloc (XVECLEN (operands[2], 0));
+  poly_int64 offset = 0;
+  for (i = 0; i < XVECLEN (operands[2], 0); i++)
+{
+  rtx reg = SET_SRC (XVECEXP (operands[2], 0, i));
+  gcc_assert (REG_P (reg));
+  rtx offset_rtx = gen_int_mode (offset, Pmode);
+  rtx piece = gen_rtx_EXPR_LIST (VOIDmode, reg, offset_rtx);
+  RTVEC_ELT (rets, i) = piece;
+  offset += GET_MODE_SIZE (GET_MODE (reg));
+}
+  rtx ret = gen_rtx_PARALLEL (VOIDmode, rets);
+
   /* Untyped calls always use the default ABI.  It's only possible to use
  ABI variants if we know the type of the target function.  */
-  emit_call_insn (gen_call (operands[0], const0_rtx, const0_rtx));
+  emit_call_insn (gen_call_value (ret, operands[0], const0_rtx, const0_rtx));
 
   for (i = 0; i < XVECLEN (operands[2], 0); i++)
 {
-- 
2.25.1



Re: Question about dynamic choosing vectorization factor for RVV

2023-08-31 Thread Richard Sandiford via Gcc
"juzhe.zh...@rivai.ai"  writes:
> Thanks Richi.
>
> I am trying to figure out how to adjust finish_cost to lower the LMUL
>
> For example:
>
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
> a[i] = a[i] + b[i];
> }
>
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)

But is the LMUL decided by the mode?  Like Richard says, the vectoriser
already provides a way of trying vectorisation with different modes and
picking the best one, via autovectorize_vector_modes, VECT_COMPARE_COST,
and the cost structures.  preferred_simd_mode then just picks the first
mode to try -- the choide isn't final.

The idea is that you get to see what vectorisation looks like with
multiple mode choices, and can pick the best one.

It's not clear from your reply whether you've tried that or not.

Thanks,
Richard


Re: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> But in the VLA case, doesn't it instead have precision 4+4X?
>> The problem then is that we can't tell at compile time which
>> byte that corresponds to.  So...
>
> Yes 4 + 4x.  I keep getting confused with poly modes :)
> In this case we want to extract the bitnum [3 4] = 3 + 4x which
> would be in byte 0 for x = 0 or x = 1 and in byte 1 for x = 2, 3 and
> so on.
>
> Can't we still make that work somehow?  As far as I can tell we're looking
> for the byte range to be accessed.  It's not like we have a precision or
> bitnum of e.g. [3 17] where the access could be anywhere but still a pow2
> fraction of BITS_PER_UNIT.
>
> I'm just having trouble writing that down.
>
> What about something like
>
> int factor = BITS_PER_UINT / prec.coeffs[0];
> bytenum = force_align_down_and_div (bitnum, prec.coeffs[0]);
> bytenum *= factor;
>
> (or a similar thing done manually without helpers) guarded by the
> proper condition?
> Or do we need something more generic for the factor (i.e. prec.coeffs[0])
> is not enough when we have a precision like [8 16]? Does that even exist?.

It's not just a question of which byte though.  It's also a question
of which bit.

One option would be to code-generate for even X and for odd X, and select
between them at runtime.  But that doesn't scale well to 2+2X and 1+1X.

Otherwise I think we need to treat the bit position as a variable,
with bitpos % 8 and bitpos / 8 being calculated at runtime.

Thanks,
Richard




RE: [PATCH] expmed: Allow extract_bit_field via mem for low-precision modes.

2023-08-30 Thread Richard Sandiford via Gcc-patches
[Sorry for any weird MUA issues, don't have access to my usual set-up.]

> when looking at a riscv ICE in vect-live-6.c I noticed that we
> assume that the variable part (coeffs[1] * x1) of the to-be-extracted
> bit number in extract_bit_field_1 is a multiple of BITS_PER_UNIT.
>
> This means that bits_to_bytes_round_down and num_trailing_bits
> cannot handle e.g. extracting from a "VNx4BI"-mode vector which has
> 4-bit precision on riscv.

But in the VLA case, doesn't it instead have precision 4+4X?
The problem then is that we can't tell at compile time which
byte that corresponds to.  So...

> This patch adds a special case for that situation and sets bytenum to
> zero as well as bitnum to its proper value.  It works for the riscv
> case because in all other situations we can align to a byte boundary.
> If x1 were 3 for some reason, however, the above assertion would still
> fail.  I don't think this can happen for riscv as we only ever double
> the number of chunks for larger vector sizes but not sure about the
> general case.
>
> If there's another, correct way to work around feel free to suggest.
>
> Bootstrap/testsuite on aarch64 and x86 is running but I would be
> surprised if there were any changes as riscv is the only target that
> uses modes with precision < 8.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>   * expmed.cc (extract_bit_field_1): Handle bitnum with variable
>   part less than BITS_PER_UNIT.
> ---
>  gcc/expmed.cc | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/expmed.cc b/gcc/expmed.cc
> index e22e43c8505..1b0119f9cfc 100644
> --- a/gcc/expmed.cc
> +++ b/gcc/expmed.cc
> @@ -1858,8 +1858,22 @@ extract_bit_field_1 (rtx str_rtx, poly_uint64 bitsize, 
> poly_uint64 bitnum,
>   but is useful for things like vector booleans.  */
>if (MEM_P (op0) && !bitnum.is_constant ())
>  {
> -  bytenum = bits_to_bytes_round_down (bitnum);
> -  bitnum = num_trailing_bits (bitnum);
> +  /* bits_to_bytes_round_down tries to align to a byte (BITS_PER_UNIT)
> +  boundary and asserts that bitnum.coeffs[1] % BITS_PER_UNIT == 0.
> +  For modes with precision < BITS_PER_UNIT this fails but we can
> +  still extract from the first byte.  */
> +  poly_uint16 prec = GET_MODE_PRECISION (outermode);
> +  if (prec.coeffs[1] < BITS_PER_UNIT && bitnum.coeffs[1] < BITS_PER_UNIT)
> + {
> +   bytenum = 0;
> +   bitnum = bitnum.coeffs[0] & (BITS_PER_UNIT - 1);

...this doesn't look right.  We can't drop bitnum.coeffs[1] when it's
nonzero, because it says that for some runtime vector sizes, the bit
position might be higher than bitnum.coeffs[0].

Also, it's not possible to access coeffs[1] unconditionally in
target-independent code.

Thanks,
Richard

> + }
> +  else
> + {
> +   bytenum = bits_to_bytes_round_down (bitnum);
> +   bitnum = num_trailing_bits (bitnum);
> + }
> +
>poly_uint64 bytesize = bits_to_bytes_round_up (bitnum + bitsize);
>op0 = adjust_bitfield_address_size (op0, BLKmode, bytenum, bytesize);
>op0_mode = opt_scalar_int_mode ();



[PATCH] attribs: Use existing traits for excl_hash_traits

2023-08-29 Thread Richard Sandiford via Gcc-patches
excl_hash_traits can be defined more simply by reusing existing traits.

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* attribs.cc (excl_hash_traits): Delete.
(test_attribute_exclusions): Use pair_hash and nofree_string_hash
instead.
---
 gcc/attribs.cc | 45 +++--
 1 file changed, 3 insertions(+), 42 deletions(-)

diff --git a/gcc/attribs.cc b/gcc/attribs.cc
index b8cb55b97df..0d4ab23aeb6 100644
--- a/gcc/attribs.cc
+++ b/gcc/attribs.cc
@@ -2640,47 +2640,6 @@ namespace selftest
 
 typedef std::pair excl_pair;
 
-struct excl_hash_traits: typed_noop_remove
-{
-  typedef excl_pair  value_type;
-  typedef value_type compare_type;
-
-  static hashval_t hash (const value_type )
-  {
-hashval_t h1 = htab_hash_string (x.first);
-hashval_t h2 = htab_hash_string (x.second);
-return h1 ^ h2;
-  }
-
-  static bool equal (const value_type , const value_type )
-  {
-return !strcmp (x.first, y.first) && !strcmp (x.second, y.second);
-  }
-
-  static void mark_deleted (value_type )
-  {
-x = value_type (NULL, NULL);
-  }
-
-  static const bool empty_zero_p = false;
-
-  static void mark_empty (value_type )
-  {
-x = value_type ("", "");
-  }
-
-  static bool is_deleted (const value_type )
-  {
-return !x.first && !x.second;
-  }
-
-  static bool is_empty (const value_type )
-  {
-return !*x.first && !*x.second;
-  }
-};
-
-
 /* Self-test to verify that each attribute exclusion is symmetric,
meaning that if attribute A is encoded as incompatible with
attribute B then the opposite relationship is also encoded.
@@ -2690,13 +2649,15 @@ struct excl_hash_traits: typed_noop_remove
 static void
 test_attribute_exclusions ()
 {
+  using excl_hash_traits = pair_hash;
+
   /* Iterate over the array of attribute tables first (with TI0 as
  the index) and over the array of attribute_spec in each table
  (with SI0 as the index).  */
   const size_t ntables = ARRAY_SIZE (attribute_tables);
 
   /* Set of pairs of mutually exclusive attributes.  */
-  typedef hash_set exclusion_set;
+  typedef hash_set exclusion_set;
   exclusion_set excl_set;
 
   for (size_t ti0 = 0; ti0 != ntables; ++ti0)
-- 
2.25.1



Re: [PATCH] fwprop: Allow UNARY_P and check register pressure.

2023-08-29 Thread Richard Sandiford via Gcc-patches
Jeff Law  writes:
> On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:
>> Ping.  I refined the code and some comments a bit and added a test
>> case.
>> 
>> My question in general would still be:  Is this something we want
>> given that we potentially move some of combine's work a bit towards
>> the front of the RTL pipeline?
>> 
>> Regards
>>   Robin
>> 
>> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.
>> 
>> This patch enables the forwarding of UNARY_P sources.  As this
>> involves potentially replacing a vector register with a scalar register
>> the ira_hoist_pressure machinery is used to calculate the change in
>> register pressure.  If the propagation would increase the pressure
>> beyond the number of hard regs, we don't perform it.
>> 
>> gcc/ChangeLog:
>> 
>>  * fwprop.cc (fwprop_propagation::profitable_p): Add unary
>>  handling.
>>  (fwprop_propagation::update_register_pressure): New function.
>>  (fwprop_propagation::register_pressure_high_p): New function
>>  (reg_single_def_for_src_p): Look through unary expressions.
>>  (try_fwprop_subst_pattern): Check register pressure.
>>  (forward_propagate_into): Call new function.
>>  (fwprop_init): Init register pressure.
>>  (fwprop_done): Clean up register pressure.
>>  (fwprop_insn): Add comment.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> So I was hoping that Richard S. would chime in here as he knows this 
> code better than anyone.

Heh, I'm not sure about that.  I rewrote the code to use rtl-ssa,
so in that sense I'm OK with the framework side.  But I tried to
preserve the decisions that the old pass made as closely as possible.
I don't know why most of those decisions were made (which is why I just
kept them).

So I don't think I have a good feel for the advantages and disadvantages
of doing this.  Robin's analysis of the aarch64 changes was nice and
detailed though.  I think the one that worries me most is the addressing
mode one.  fwprop is probably the first chance we get to propagate adds
into addresses, and virtual register elimination means that some of
those opportunities won't show up in gimple.

There again, virtual register elimination wouldn't be the reason for
the ld4_s8.c failure.  Perhaps there's something missing in expand.

Other than that, I think my main question is: why just unary operations?
Is the underlying assumption that we only want to propagate a maximum of
one register?  If so, then I think we should check for that directly, by
iterating over subrtxes.

That way we can handle things like binary operations involving a
register and a constant, and unspecs with a single non-constant operand.

I imagine the check would be something like:

  unsigned int nregs = 0;
  for (each subrtx x)
{
  if (MEM_P (x))
return false;
  if (SUBREG_P (x) && .../*current conditions */...)
return false;
  if (REG_P (x))
{
  nregs += 1;
  if (nregs > 1)
return false;
}
}
  return true;

Perhaps we should allow the optimisation without register-pressure
information if (a) the source register and destination register are
in the same pressure class and (b) all uses of the destination are
being replaced.  (FWIW, rtl-ssa should make it easier to try to
replace all definitions at once, with an all-or-nothing choice,
if we ever wanted to do that.)

Thanks,
Richard

>
> This looks like a much better implementation of something I've done 
> before :-)  Basically imagine a target where a sign/zero extension can 
> be folded into arithmetic for free.  We put in various hacks to this 
> code to encourage more propagations of extensions.
>
> I still think this is valuable.  As we lower from gimple->RTL we're 
> going to still have artifacts in the RTL that we're going to want to 
> optimize away.  fwprop has certain advantages over combine, including 
> the fact that it runs earlier, pre-loop.
>
>
> It looks generally sensible to me.  But give Richard S. another week to 
> chime in.  He seems to be around, but may be slammed with stuff right now.
>
> jeff


Re: [RFC] > WIDE_INT_MAX_PREC support in wide-int

2023-08-29 Thread Richard Sandiford via Gcc-patches
Just some off-the-cuff thoughts.  Might think differently when
I've had more time...

Richard Biener  writes:
> On Mon, 28 Aug 2023, Jakub Jelinek wrote:
>
>> Hi!
>> 
>> While the _BitInt series isn't committed yet, I had a quick look at
>> lifting the current lowest limitation on maximum _BitInt precision,
>> that wide_int can only support wide_int until WIDE_INT_MAX_PRECISION - 1.
>> 
>> Note, other limits if that is lifted are INTEGER_CST currently using 3
>> unsigned char members and so being able to only hold up to 255 * 64 = 16320
>> bit numbers and then TYPE_PRECISION being 16-bit, so limiting us to 65535
>> bits.  The INTEGER_CST limit could be dealt with by dropping the
>> int_length.offset "cache" and making int_length.extended and
>> int_length.unextended members unsinged short rather than unsigned char.
>> 
>> The following so far just compile tested patch changes wide_int_storage
>> to be a union, for precisions up to WIDE_INT_MAX_PRECISION inclusive it
>> will work as before (just being no longer trivially copyable type and
>> having an inline destructor), while larger precision instead use a pointer
>> to heap allocated array.
>> For wide_int this is fairly easy (of course, I'd need to see what the
>> patch does to gcc code size and compile time performance, some
>> growth/slowdown is certain), but I'd like to brainstorm on
>> widest_int/widest2_int.
>> 
>> Currently it is a constant precision storage with WIDE_INT_MAX_PRECISION
>> precision (widest2_int twice that), so memory layout-wide on at least 64-bit
>> hosts identical to wide_int, just it doesn't have precision member and so
>> 32 bits smaller on 32-bit hosts.  It is used in lots of places.
>> 
>> I think the most common is what is done e.g. in tree_int_cst* comparisons
>> and similarly, using wi::to_widest () to just compare INTEGER_CSTs.
>> That case actually doesn't even use wide_int but widest_extended_tree
>> as storage, unless stored into widest_int in between (that happens in
>> various spots as well).  For comparisons, it would be fine if
>> widest_int_storage/widest_extended_tree storages had a dynamic precision,
>> WIDE_INT_MAX_PRECISION for most of the cases (if only
>> precision < WIDE_INT_MAX_PRECISION is involved), otherwise the needed
>> precision (e.g. for binary ops) which would be what we say have in
>> INTEGER_CST or some type, rounded up to whole multiples of HOST_WIDE_INTs
>> and if unsigned with multiple of HOST_WIDE_INT precision, have another
>> HWI to make it always sign-extended.
>> 
>> Another common case is how e.g. tree-ssa-ccp.cc uses them, that is mostly
>> for bitwise ops and so I think the above would be just fine for that case.
>> 
>> Another case is how tree-ssa-loop-niter.cc uses it, I think for such a usage
>> it really wants something widest, perhaps we could just try to punt for
>> _BitInt(N) for N >= WIDE_INT_MAX_PRECISION in there, so that we never care
>> about bits beyond that limit?
>
> I'll note tree-ssa-loop-niter.cc also uses GMP in some cases, widest_int
> is really trying to be poor-mans GMP by limiting the maximum precision.

I'd characterise widest_int as "a wide_int that is big enough to hold
all supported integer types, without losing sign information".  It's
not big enough to do arbitrary arithmetic without losing precision
(in the way that GMP is).

If the new limit on integer sizes is 65535 bits for all targets,
then I think that means that widest_int needs to become a 65536-bit type.
(But not with all bits represented all the time, of course.)

[ And at that point I think widest_int should ideally become a GMP wrapper.
  The wide_int stuff isn't optimised for such large sizes, even accepting
  that large sizes will be a worst case.  That might not be easy to do with
  the current infrastructure though.  Especially not if widest_ints are
  stored in GC-ed structures. ]

That seems like it would stand the biggest chance of preserving
existing semantics.  But we might want to define new typedefs for
narrower limits.  E.g. the current widest_int limit probably still
makes sense for operations on scalar_int_modes.  (But then most
RTL arithmetic should use wide_int rather than widest_int.)

Perhaps some widest_int uses are really restricted to address-like
things and could instead use offset_int.  Until now there hasn't been
much incentive to make the distinction.

And perhaps we could identify other similar cases where the limit is
known (statically) to be the current limit, rather than 65536.

I think one of the worst things we could do is push the requirement
up to users of the API to have one path for _BitInts and one for "normal"
integers.  That's bound to lead to a whack-a-mole effect.

Thanks,
Richard


Re: [PATCH V2] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-24 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> Hi, Richard and Richi.
>
> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>
> Consider this following case:
> #define TEST_TYPE(TYPE)   
>  \
>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,   
>  \
> TYPE *__restrict a,  \
> TYPE *__restrict b, int n)   \
>   {   
>  \
> for (int i = 0; i < n; i++)   
>  \
>   dst[i] -= a[i] * b[i];   \
>   }
>
> #define TEST_ALL()
>  \
>   TEST_TYPE (float)   
>  \
>
> TEST_ALL ()
>
> Gimple IR for RVV:
>
> ...
> _39 = -vect__8.14_26;
> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
> vect__4.8_34, vect__4.8_34, _46, 0);
> ...
>
> This is because this following piece of codes in tree-ssa-math-opts.cc:
>
>   if (len)
>   fma_stmt
> = gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>   addop, else_value, len, bias);
>   else if (cond)
>   fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>  op2, addop, else_value);
>   else
>   fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>  use_stmt));
>   gsi_replace (, fma_stmt, true);
>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>regardless of where the negation occurs.  */
>   gimple *orig_stmt = gsi_stmt (gsi);
>   if (fold_stmt (, follow_all_ssa_edges))
>   {
> if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>   gcc_unreachable ();
> update_stmt (gsi_stmt (gsi));
>   }
>
> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>
> This patch support STMT fold into:
>
> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
> vect__4.8_34, { 0.0, ... }, _46, 0);
>
> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>
> Extend maximum num ops:
> -  static const unsigned int MAX_NUM_OPS = 5;
> +  static const unsigned int MAX_NUM_OPS = 7;
>
> Bootstrap and Regtest on X86 passed.
> Tested on aarch64 Qemu.
>
> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>
>
> gcc/ChangeLog:
>
> * genmatch.cc (decision_tree::gen): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> * gimple-match-exports.cc (gimple_simplify): Ditto.
> (gimple_resimplify6): New function.
> (gimple_resimplify7): New function.
> (gimple_match_op::resimplify): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> (convert_conditional_op): Ditto.
> (build_call_internal): Ditto.
> (try_conditional_simplification): Ditto.
> (gimple_extract): Ditto.
> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
> * internal-fn.cc (CASE): Ditto.

OK, thanks.

Richard

>
> ---
>  gcc/genmatch.cc |   2 +-
>  gcc/gimple-match-exports.cc | 123 ++--
>  gcc/gimple-match.h  |  16 -
>  gcc/internal-fn.cc  |   7 +-
>  4 files changed, 138 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> index f46d2e1520d..a1925a747a7 100644
> --- a/gcc/genmatch.cc
> +++ b/gcc/genmatch.cc
> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>  }
>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>  
> -  for (unsigned n = 1; n <= 5; ++n)
> +  for (unsigned n = 1; n <= 7; ++n)
>  {
>bool has_kids_p = false;
>  
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index 7aeb4ddb152..b36027b0bad 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, gimple_seq 
> *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree);
>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree, tree);
> +extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
> +  code_helper, tree, tree, 

Re: [PATCH] RISC-V: Add conditional unary neg/abs/not autovec patterns

2023-08-24 Thread Richard Sandiford via Gcc-patches
Jeff Law  writes:
> On 8/22/23 02:08, juzhe.zh...@rivai.ai wrote:
>> Yes, I agree long-term we want every-thing be optimized as early as 
>> possible.
>> 
>> However, IMHO, it's impossible we can support every conditional patterns 
>> in the middle-end (match.pd).
>> It's a really big number.
>> 
>> For example, for sign_extend conversion, we have vsext.vf2 (vector SI -> 
>> vector DI),... vsext.vf4 (vector HI -> vector DI), vsext.vf8 (vector QI 
>> -> vector DI)..
>> Not only the conversion, every auto-vectorization patterns can have 
>> conditional format.
>> For example, abs,..rotate, sqrt, floor, ceil,etc.
>> I bet it could be over 100+ conditional optabs/internal FNs. It's huge 
>> number.
>> I don't see necessity that we should support them in middle-end 
>> (match.pd) since we known RTL back-end combine PASS can do the good job 
>> here.
>> 
>> Besides, LLVM doesn't such many conditional pattern. LLVM just has "add" 
>> and "select" separate IR then do the combine in the back-end:
>> https://godbolt.org/z/rYcMMG1eT 
>> 
>> You can see LLVM didn't do the op + select optimization in generic IR, 
>> they do the optimization in combine PASS.
>> 
>> So I prefer this patch solution and apply such solution for the future 
>> more support : sign extend, zero extend, float extend, abs, sqrt, ceil, 
>> floor, etc.
> It's certainly got the potential to get out of hand.  And it's not just 
> the vectorizer operations.  I know of an architecture that can execute 
> most of its ALU and loads/stores conditionally (not predication, but 
> actual conditional ops) like target  = (x COND Y) ? a << b ; a)
>
> I'd tend to lean towards synthesizing these conditional ops around a 
> conditional move/select primitive in gimple through the RTL expanders. 
> That would in turn set things up so that if the target had various 
> conditional operations like conditional shift it could be trivially 
> discovered by the combiner.

FWIW, one of the original motivations behind the COND_* internal
functions was to represent the fact that the operation is suppressed
(rather than being performed and discarded) when the predicate is false.
This allows if-conversion for FP operations even in strict FP modes,
since inactive lanes are guaranteed not to generate an exception.

I think it makes sense to add COND_* functions for anything that can
reasonably be done on FP types, and that could generate an FP exception.
E.g. sqrt was one of the examples mentioned, and I think COND_SQRT is
something that we should have.

I agree it's less clear-cut for purely integer stuff, or for FP operations
like neg and abs that are pure bit manipulation.  But perhaps there's a
question of how many operations are only defined for integers, and
whether the number is high enough for them to be treated differently.

I wouldn't have expected an explosion of operations to be a significant
issue, since (a) the underlying infrastructure is pretty mechanical and
(b) any operation that a target supports is going to need an .md pattern
whatever happens.

Thanks,
Richard


Re: [PATCH 03/11] aarch64: Use br instead of ret for eh_return

2023-08-24 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Rather than hiding this in target code, perhaps we should add a
> target-independent concept of an "eh_return taken" flag, say
> EH_RETURN_TAKEN_RTX.
>
> We could define it so that, on targets that define EH_RETURN_TAKEN_RTX,
> a register EH_RETURN_STACKADJ_RTX and a register EH_RETURN_HANDLER_RTX
> are only meaningful when the flag is true.  E.g. we could have:
>
> #ifdef EH_RETURN_HANDLER_RTX

Gah, I meant #ifdef EH_RETURN_TAKEN_RTX here

>   for (rtx tmp : { EH_RETURN_STACKADJ_RTX, EH_RETURN_HANDLER_RTX })
> if (tmp && REG_P (tmp))
>   emit_clobber (tmp);
> #endif
>
> in the "normal return" part of expand_eh_return.  (If some other target
> wants a flag with different semantics, it'd be up to them to add it.)
>
> That should avoid most of the bad code-quality effects, since the
> specialness of x4-x6 will be confined to the code immediately before
> the pre-epilogue exit edges.
>
> Thanks,
> Richard


Re: [PATCH] tree-optimization/111115 - SLP of masked stores

2023-08-24 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following adds the capability to do SLP on .MASK_STORE, I do not
> plan to add interleaving support.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

LGTM, thanks.

Richard

> Thanks,
> Richard.
>
>   PR tree-optimization/15
> gcc/
>   * tree-vectorizer.h (vect_slp_child_index_for_operand): New.
>   * tree-vect-data-refs.cc (can_group_stmts_p): Also group
>   .MASK_STORE.
>   * tree-vect-slp.cc (arg3_arg2_map): New.
>   (vect_get_operand_map): Handle IFN_MASK_STORE.
>   (vect_slp_child_index_for_operand): New function.
>   (vect_build_slp_tree_1): Handle statements with no LHS,
>   masked store ifns.
>   (vect_remove_slp_scalar_calls): Likewise.
>   * tree-vect-stmts.c (vect_check_store_rhs): Lookup the
>   SLP child corresponding to the ifn value index.
>   (vectorizable_store): Likewise for the mask index.  Support
>   masked stores.
>   (vectorizable_load): Lookup the SLP child corresponding to the
>   ifn mask index.
>
> gcc/testsuite/
>   * lib/target-supports.exp (check_effective_target_vect_masked_store):
>   Supported with check_avx_available.
>   * gcc.dg/vect/slp-mask-store-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c | 39 +
>  gcc/testsuite/lib/target-supports.exp|  3 +-
>  gcc/tree-vect-data-refs.cc   |  3 +-
>  gcc/tree-vect-slp.cc | 46 +---
>  gcc/tree-vect-stmts.cc   | 23 +-
>  gcc/tree-vectorizer.h|  1 +
>  6 files changed, 94 insertions(+), 21 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c 
> b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
> new file mode 100644
> index 000..50b7066778e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-mask-store-1.c
> @@ -0,0 +1,39 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-mavx2" { target avx2 } } */
> +
> +#include "tree-vect.h"
> +
> +void __attribute__((noipa))
> +foo (unsigned * __restrict x, int * __restrict flag)
> +{
> +  for (int i = 0; i < 32; ++i)
> +{
> +  if (flag[2*i+0])
> +x[2*i+0] = x[2*i+0] + 3;
> +  if (flag[2*i+1])
> +x[2*i+1] = x[2*i+1] + 177;
> +}
> +}
> +
> +unsigned x[16];
> +int flag[32] = { 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
> + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> +unsigned res[16] = { 3, 177, 0, 0, 0, 177, 3, 0, 3, 177, 0, 0, 0, 177, 3, 0 
> };
> +
> +int
> +main ()
> +{
> +  check_vect ();
> +
> +  foo (x, flag);
> +
> +  if (__builtin_memcmp (x, res, sizeof (x)) != 0)
> +abort ();
> +  for (int i = 0; i < 32; ++i)
> +if (flag[i] != 0 && flag[i] != 1)
> +  abort ();
> +
> +  return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 1 "vect" { target { 
> vect_masked_store && vect_masked_load } } } } */
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index d4623ee6b45..d353cc0aaf0 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -8400,7 +8400,8 @@ proc check_effective_target_vect_masked_load { } {
>  # Return 1 if the target supports vector masked stores.
>  
>  proc check_effective_target_vect_masked_store { } {
> -return [expr { [check_effective_target_aarch64_sve]
> +return [expr { [check_avx_available]
> +|| [check_effective_target_aarch64_sve]
>  || [istarget amdgcn*-*-*] }]
>  }
>  
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index 3e9a284666c..a2caf6cb1c7 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -3048,8 +3048,7 @@ can_group_stmts_p (stmt_vec_info stmt1_info, 
> stmt_vec_info stmt2_info,
>like those created by build_mask_conversion.  */
>tree mask1 = gimple_call_arg (call1, 2);
>tree mask2 = gimple_call_arg (call2, 2);
> -  if (!operand_equal_p (mask1, mask2, 0)
> -  && (ifn == IFN_MASK_STORE || !allow_slp_p))
> +  if (!operand_equal_p (mask1, mask2, 0) && !allow_slp_p)
>   {
> mask1 = strip_conversion (mask1);
> if (!mask1)
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index b5f9333fc22..cc799b6ebcd 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -503,6 +503,7 @@ static const int cond_expr_maps[3][5] = {
>  static const int arg1_map[] = { 1, 1 };
>  static const int arg2_map[] = { 1, 2 };
>  static const int arg1_arg4_map[] = { 2, 1, 4 };
> +static const int arg3_arg2_map[] = { 2, 3, 2 };
>  static const int op1_op0_map[] = { 2, 1, 0 };
>  
>  /* For most SLP statements, there is a one-to-one mapping between
> @@ -543,6 +544,9 @@ vect_get_operand_map (const gimple *stmt, unsigned char 
> swap = 0)
> case 

[PATCH] aarch64: Account for different Advanced SIMD fusing options

2023-08-24 Thread Richard Sandiford via Gcc-patches
The scalar FNMADD/FNMSUB and SVE FNMLA/FNMLS instructions mean
that either side of a subtraction can start an accumulator chain.
However, Advanced SIMD doesn't have an equivalent instruction.
This means that, for Advanced SIMD, a subtraction can only be
fused if the second operand is a multiplication.

Also, if both sides of a subtraction are multiplications,
and if the second operand is used multiple times, such as:

 c * d - a * b
 e * f - a * b

then the first rather than second multiplication operand will tend
to be fused.  On Advanced SIMD, this leads to:

 tmp1 = a * b
 tmp2 = -tmp1
  ... = tmp2 + c * d   // FMLA
  ... = tmp2 + e * f   // FMLA

where one of the FMLAs also requires a MOV.

This patch tries to account for this in the vector cost model.
It improves roms performance by 2-3% on Neoverse V1.  It's also
needed to avoid a regression in fotonik for Neoverse N2 and
Neoverse V2 with the patch for PR110625.

Tested on aarch64-linux-gnu & pushed.

Richard


gcc/
* config/aarch64/aarch64.cc: Include ssa.h.
(aarch64_multiply_add_p): Require the second operand of an
Advanced SIMD subtraction to be a multiplication.  Assume that
such an operation won't be fused if the second operand is used
multiple times and if the first operand is also a multiplication.

gcc/testsuite/
* gcc.target/aarch64/neoverse_v1_2.c: New test.
* gcc.target/aarch64/neoverse_v1_3.c: Likewise.
---
 gcc/config/aarch64/aarch64.cc | 24 ++-
 .../gcc.target/aarch64/neoverse_v1_2.c| 15 
 .../gcc.target/aarch64/neoverse_v1_3.c| 14 +++
 3 files changed, 47 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 034628148ef..37d414021ca 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -84,6 +84,7 @@
 #include "aarch64-feature-deps.h"
 #include "config/arm/aarch-common.h"
 #include "config/arm/aarch-common-protos.h"
+#include "ssa.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -16411,20 +16412,20 @@ aarch64_multiply_add_p (vec_info *vinfo, 
stmt_vec_info stmt_info,
   if (code != PLUS_EXPR && code != MINUS_EXPR)
 return false;
 
-  for (int i = 1; i < 3; ++i)
+  auto is_mul_result = [&](int i)
 {
   tree rhs = gimple_op (assign, i);
   /* ??? Should we try to check for a single use as well?  */
   if (TREE_CODE (rhs) != SSA_NAME)
-   continue;
+   return false;
 
   stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
   if (!def_stmt_info
  || STMT_VINFO_DEF_TYPE (def_stmt_info) != vect_internal_def)
-   continue;
+   return false;
   gassign *rhs_assign = dyn_cast (def_stmt_info->stmt);
   if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
-   continue;
+   return false;
 
   if (vec_flags & VEC_ADVSIMD)
{
@@ -16444,8 +16445,19 @@ aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info 
stmt_info,
}
 
   return true;
-}
-  return false;
+};
+
+  if (code == MINUS_EXPR && (vec_flags & VEC_ADVSIMD))
+/* Advanced SIMD doesn't have FNMADD/FNMSUB/FNMLA/FNMLS, so the
+   multiplication must be on the second operand (to form an FMLS).
+   But if both operands are multiplications and the second operand
+   is used more than once, we'll instead negate the second operand
+   and use it as an accumulator for the first operand.  */
+return (is_mul_result (2)
+   && (has_single_use (gimple_assign_rhs2 (assign))
+   || !is_mul_result (1)));
+
+  return is_mul_result (1) || is_mul_result (2);
 }
 
 /* Return true if STMT_INFO is the second part of a two-statement boolean AND
diff --git a/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c 
b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
new file mode 100644
index 000..45d7e81c78e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_2.c
@@ -0,0 +1,15 @@
+/* { dg-options "-O2 -mcpu=neoverse-v1 --param aarch64-autovec-preference=1 
-fdump-tree-vect-details" } */
+
+void
+f (float x[restrict][100], float y[restrict][100])
+{
+  for (int i = 0; i < 100; ++i)
+{
+  x[0][i] = y[0][i] * y[1][i] - y[3][i] * y[4][i];
+  x[1][i] = y[1][i] * y[2][i] - y[3][i] * y[4][i];
+}
+}
+
+/* { dg-final { scan-tree-dump {_[0-9]+ - _[0-9]+ 1 times vector_stmt costs 2 
} "vect" } } */
+/* { dg-final { scan-tree-dump-not {vector_stmt costs 0 } "vect" } } */
+/* { dg-final { scan-tree-dump {_[0-9]+ - _[0-9]+ 1 times scalar_stmt costs 0 
} "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c 
b/gcc/testsuite/gcc.target/aarch64/neoverse_v1_3.c
new file mode 100644
index 000..de31fc13b28
--- /dev/null
+++ 

Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> Hi Richard,
>
> (that's quick!)
>
>> +  if (size > max_copy_size || size > max_mops_size)
>> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>>
>> Could you explain this a bit more?  If I've followed the logic correctly,
>> max_copy_size will always be 0 for movmem, so this "if" condition will
>> always be true for movmem (given that the caller can be relied on to
>> optimise away zero-length copies).  So doesn't this function reduce to:
>
> In this patch it is zero yes, but there is no real reason for that. The goal 
> is to
> share as much code as possible. I have a patch that inlines memmove like
> memcpy.

But I think this part of the patch belongs in that future series.
The current patch should just concentrate on fixing the bug.

It's difficult to evaluate the change at the moment, without the follow-on
change that it's preparing for.  I don't think it stands as an indepedent
improvement in its own right.

>> when is_memmove is true?  If so, I think it would be clearer to do that
>> directly, rather than go through aarch64_expand_cpymem.  max_copy_size
>> is really an optimisation threshold, whereas the above seems to be
>> leaning on it for correctness.
>
> In principle we could for the time being add a assert (!is_memmove) if that
> makes it clearer memmove isn't yet handled.

I think for this patch movmemdi should just call aarch64_expand_cpymem_mops
directly.  Let's leave the aarch64_expand_cpymem changes to other patches.

>> ...I think we might as well keep this pattern conditional on TARGET_MOPS.
>
> But then we have inconsistencies in the conditions of the expanders, which
> is what led to all these bugs in the first place (I lost count, there are 4 
> or 5
> different bugs I fixed). Ensuring everything is 100% identical between
> memcpy and memmove makes the code much easier to follow.

I think that too should be part of your follow-on changes to do inline
movmem expansions without TARGET_MOPS.  While all supported movmemdis
require TARGET_MOPS, I think the expander should too.

>> I think we can then also split:
>>
>>   /* All three registers are changed by the instruction, so each one
>>  must be a fresh pseudo.  */
>>   rtx dst_addr = copy_to_mode_reg (Pmode, XEXP (operands[0], 0));
>>   rtx src_addr = copy_to_mode_reg (Pmode, XEXP (operands[1], 0));
>>   rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
>>   rtx src_mem = replace_equiv_address (operands[1], src_addr);
>>   rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
>>
>> out of aarch64_expand_cpymem_mops into a new function (say
>> aarch64_prepare_mops_operands) and call it from the movmemdi
>> expander.  There should then be no need for the extra staging
>> expander (aarch64_movmemdi).
>
> So you're saying we could remove aarch64_cpymemdi/movmemdi if
> aarch64_expand_cpymem_mops did massage the operands in the
> right way so that we can immediately match the underlying instruction?

Yeah.  But I'd forgotten about the pesky fourth (alignment) operand
to movmemdi and cpymemdi, which we don't need for the mops patterns.
So I take that part back.  I agree it's clearer to have a separate
aarch64_movmemdi expander.

> Hmm, does that actually work, as in we don't lose the extra alias info that
> gets lost in the current memmove expander? (another bug/inconsistency)
>
> And the MOPS code would be separated from aarch64_expand_cpymem
> so we'd do all the MOPS size tests inside aarch64_expand_cpymem_mops
> and the expander tries using MOPS first and if it fails try inline expansion?
>
> So something like:
>
> (define_expand "movmemdi"
> 
>   if (aarch64_try_mops_expansion (operands, is_memmove))
> DONE;
>   if (aarch64_try_inline_copy_expansion (operands, is_memmove))
> DONE;
>   FAIL;
> )
>
>> IMO the STRICT_ALIGNMENT stuff should be a separate patch,
>> with its own testcases.
>
> We will need backports to fix all these bugs, so the question is whether it
> is worth doing a lot of cleanups now?

But I think what I'm asking for is significantly simpler than the
original patch.  That should make it more backportable rather than less.

Thanks,
Richard


Re: [PATCH] rtl: Forward declare rtx_code

2023-08-23 Thread Richard Sandiford via Gcc-patches
"Richard Earnshaw (lists)"  writes:
> On 23/08/2023 16:49, Richard Sandiford via Gcc-patches wrote:
>> Richard Earnshaw via Gcc-patches  writes:
>>> Now that we require C++ 11, we can safely forward declare rtx_code
>>> so that we can use it in target hooks.
>>>
>>> gcc/ChangeLog
>>> * coretypes.h (rtx_code): Add forward declaration.
>>> * rtl.h (rtx_code): Make compatible with forward declaration.
>>> ---
>>>  gcc/coretypes.h | 4 
>>>  gcc/rtl.h   | 2 +-
>>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gcc/coretypes.h b/gcc/coretypes.h
>>> index ca8837cef67..51e9ce0 100644
>>> --- a/gcc/coretypes.h
>>> +++ b/gcc/coretypes.h
>>> @@ -100,6 +100,10 @@ struct gimple;
>>>  typedef gimple *gimple_seq;
>>>  struct gimple_stmt_iterator;
>>>  
>>> +/* Forward declare rtx_code, so that we can use it in target hooks without
>>> +   needing to pull in rtl.h.  */
>>> +enum rtx_code : unsigned;
>>> +
>>>  /* Forward decls for leaf gimple subclasses (for individual gimple codes).
>>> Keep this in the same order as the corresponding codes in gimple.def.  
>>> */
>>>  
>>> diff --git a/gcc/rtl.h b/gcc/rtl.h
>>> index e1c51156f90..0e9491b89b4 100644
>>> --- a/gcc/rtl.h
>>> +++ b/gcc/rtl.h
>>> @@ -45,7 +45,7 @@ class predefined_function_abi;
>>>  /* Register Transfer Language EXPRESSIONS CODES */
>>>  
>>>  #define RTX_CODE   enum rtx_code
>>> -enum rtx_code  {
>>> +enum rtx_code : unsigned {
>>>  
>>>  #define DEF_RTL_EXPR(ENUM, NAME, FORMAT, CLASS)   ENUM ,
>>>  #include "rtl.def" /* rtl expressions are documented here */
>> 
>> Given:
>> 
>>   #define RTX_CODE_BITSIZE 8
>> 
>> there might be some value in making it uint8_t rather than unsigned.
>> Preapproved if you agree.
>> 
>> But the patch as posted is a strict improvement over the status quo,
>> so it's also OK as-is.
>> 
>> Thanks,
>> Richard
>
> I did think about that, but there were two reasons for not doing so:
> - it presumes we would never want more than 8 bits for rtx_code (well, not 
> quite, 
> but it would make it more work to change this).

The rtx_def structure itself provides a significant barrier to that though.

If we ever think that we need to represent more than 256 separate
operations, I think the natural way would be to treat the less well-used
ones in a similar way to unspecs.

> - it would probably lead to more zero-extension operations happening in the 
> compiler

Yeah, that's true.  The upside though is that we could then declare
arrays of codes directly, without having to resort to "unsigned char"
tricks.  That's unlikely to help codes much, but the same principle
would apply to modes, which are more frequently put into arrays.

E.g. one of the issues with bumping the machine_mode bitfield from 8 to
16 bits was finding all the places where "unsigned char" was used to
hold modes, and changing them to "unsigned short".  If machine_mode was
instead the "right" size, we could just call a spade a spade.

But like I say, that's mostly reasoning by analogy rather than because
the size of rtx_code itself is important.

Richard


Re: [PATCH] rtl: use rtx_code for gen_ccmp_first and gen_ccmp_next

2023-08-23 Thread Richard Sandiford via Gcc-patches
Richard Earnshaw via Gcc-patches  writes:
> Note, this patch is dependent on the patch I posted yesterday to
> forward declare rtx_code in coretypes.h.
>
> --
> Now that we have a forward declaration of rtx_code in coretypes.h, we
> can adjust these hooks to take rtx_code arguments rather than an int.
>
> gcc/ChangeLog:
>
>   * target.def (gen_ccmp_first, gen_ccmp_next): Use rtx_code for
>   CODE, CMP_CODE and BIT_CODE arguments.
>   * config/aarch64/aarch64.cc (aarch64_gen_ccmp_first): Likewise.
>   (aarch64_gen_ccmp_next): Likewise.
>   * doc/tm.texi: Regenerated.

OK, thanks.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 5 +++--
>  gcc/doc/tm.texi   | 4 ++--
>  gcc/target.def| 4 ++--
>  3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..bc09185b8ec 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25585,7 +25585,7 @@ aarch64_asan_shadow_offset (void)
>  
>  static rtx
>  aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn **gen_seq,
> - int code, tree treeop0, tree treeop1)
> + rtx_code code, tree treeop0, tree treeop1)
>  {
>machine_mode op_mode, cmp_mode, cc_mode = CCmode;
>rtx op0, op1;
> @@ -25659,7 +25659,8 @@ aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
> **gen_seq,
>  
>  static rtx
>  aarch64_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev,
> -int cmp_code, tree treeop0, tree treeop1, int bit_code)
> +rtx_code cmp_code, tree treeop0, tree treeop1,
> +rtx_code bit_code)
>  {
>rtx op0, op1, target;
>machine_mode op_mode, cmp_mode, cc_mode = CCmode;
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 95ba56e05ae..75cb8e3417c 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -12005,7 +12005,7 @@ This target hook is required only when the target has 
> several different
>  modes and they have different conditional execution capability, such as ARM.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_FIRST (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, int @var{code}, tree @var{op0}, 
> tree @var{op1})
> +@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_FIRST (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx_code @var{code}, tree 
> @var{op0}, tree @var{op1})
>  This function prepares to emit a comparison insn for the first compare in a
>   sequence of conditional comparisions.  It returns an appropriate comparison
>   with @code{CC} for passing to @code{gen_ccmp_next} or @code{cbranch_optab}.
> @@ -12015,7 +12015,7 @@ This function prepares to emit a comparison insn for 
> the first compare in a
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.
>  @end deftypefn
>  
> -@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_NEXT (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx @var{prev}, int 
> @var{cmp_code}, tree @var{op0}, tree @var{op1}, int @var{bit_code})
> +@deftypefn {Target Hook} rtx TARGET_GEN_CCMP_NEXT (rtx_insn 
> **@var{prep_seq}, rtx_insn **@var{gen_seq}, rtx @var{prev}, rtx_code 
> @var{cmp_code}, tree @var{op0}, tree @var{op1}, rtx_code @var{bit_code})
>  This function prepares to emit a conditional comparison within a sequence
>   of conditional comparisons.  It returns an appropriate comparison with
>   @code{CC} for passing to @code{gen_ccmp_next} or @code{cbranch_optab}.
> diff --git a/gcc/target.def b/gcc/target.def
> index 7d684296c17..3ad0bde3ece 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2735,7 +2735,7 @@ DEFHOOK
>   insns are saved in @var{gen_seq}.  They will be emitted when all the\n\
>   compares in the conditional comparision are generated without error.\n\
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.",
> - rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, int code, tree op0, tree 
> op1),
> + rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx_code code, tree op0, 
> tree op1),
>   NULL)
>  
>  DEFHOOK
> @@ -2752,7 +2752,7 @@ DEFHOOK
>   be appropriate for passing to @code{gen_ccmp_next} or 
> @code{cbranch_optab}.\n\
>   @var{code} is the @code{rtx_code} of the compare for @var{op0} and 
> @var{op1}.\n\
>   @var{bit_code} is @code{AND} or @code{IOR}, which is the op on the 
> compares.",
> - rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, int cmp_code, tree 
> op0, tree op1, int bit_code),
> + rtx, (rtx_insn **prep_seq, rtx_insn **gen_seq, rtx prev, rtx_code cmp_code, 
> tree op0, tree op1, rtx_code bit_code),
>   NULL)
>  
>  /* Return a new value for loop unroll size.  */


Re: [PATCH] rtl: Forward declare rtx_code

2023-08-23 Thread Richard Sandiford via Gcc-patches
Richard Earnshaw via Gcc-patches  writes:
> Now that we require C++ 11, we can safely forward declare rtx_code
> so that we can use it in target hooks.
>
> gcc/ChangeLog
>   * coretypes.h (rtx_code): Add forward declaration.
>   * rtl.h (rtx_code): Make compatible with forward declaration.
> ---
>  gcc/coretypes.h | 4 
>  gcc/rtl.h   | 2 +-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/coretypes.h b/gcc/coretypes.h
> index ca8837cef67..51e9ce0 100644
> --- a/gcc/coretypes.h
> +++ b/gcc/coretypes.h
> @@ -100,6 +100,10 @@ struct gimple;
>  typedef gimple *gimple_seq;
>  struct gimple_stmt_iterator;
>  
> +/* Forward declare rtx_code, so that we can use it in target hooks without
> +   needing to pull in rtl.h.  */
> +enum rtx_code : unsigned;
> +
>  /* Forward decls for leaf gimple subclasses (for individual gimple codes).
> Keep this in the same order as the corresponding codes in gimple.def.  */
>  
> diff --git a/gcc/rtl.h b/gcc/rtl.h
> index e1c51156f90..0e9491b89b4 100644
> --- a/gcc/rtl.h
> +++ b/gcc/rtl.h
> @@ -45,7 +45,7 @@ class predefined_function_abi;
>  /* Register Transfer Language EXPRESSIONS CODES */
>  
>  #define RTX_CODE enum rtx_code
> -enum rtx_code  {
> +enum rtx_code : unsigned {
>  
>  #define DEF_RTL_EXPR(ENUM, NAME, FORMAT, CLASS)   ENUM ,
>  #include "rtl.def"   /* rtl expressions are documented here */

Given:

  #define RTX_CODE_BITSIZE 8

there might be some value in making it uint8_t rather than unsigned.
Preapproved if you agree.

But the patch as posted is a strict improvement over the status quo,
so it's also OK as-is.

Thanks,
Richard


Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Richard Sandiford via Gcc-patches
Wilco Dijkstra  writes:
> A MOPS memmove may corrupt registers since there is no copy of the input 
> operands to temporary
> registers.  Fix this by calling aarch64_expand_cpymem which does this.  Also 
> fix an issue with
> STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or 
> generating a huge
> expansion if aarch64_mops_memcpy_size_threshold is large.
>
> Passes regress/bootstrap, OK for commit?
>
> gcc/ChangeLog/
> PR target/21
> * config/aarch64/aarch64.md (cpymemdi): Remove STRICT_ALIGNMENT, add 
> param for memmove.
> (aarch64_movmemdi): Add new expander similar to aarch64_cpymemdi.
> (movmemdi): Like cpymemdi call aarch64_expand_cpymem for correct 
> expansion.
> * config/aarch64/aarch64.cc (aarch64_expand_cpymem_mops): Add support 
> for memmove.
> (aarch64_expand_cpymem): Add support for memmove. Handle 
> STRICT_ALIGNMENT correctly.
> Handle TARGET_MOPS size selection correctly.
> * config/aarch64/aarch64-protos.h (aarch64_expand_cpymem): Update 
> prototype.
>
> gcc/testsuite/ChangeLog/
> PR target/21
> * gcc.target/aarch64/mops_4.c: Add memmove testcases.
>
> ---
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 70303d6fd953e0c397b9138ede8858c2db2e53db..97375e81cbda078847af83bf5dd4e0d7673d6af4
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -765,7 +765,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx);
>  bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
>  tree aarch64_vector_load_decl (tree);
>  void aarch64_expand_call (rtx, rtx, rtx, bool);
> -bool aarch64_expand_cpymem (rtx *);
> +bool aarch64_expand_cpymem (rtx *, bool);
>  bool aarch64_expand_setmem (rtx *);
>  bool aarch64_float_const_zero_rtx_p (rtx);
>  bool aarch64_float_const_rtx_p (rtx);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> eba5d4a7e04b7af82437453a691d5607d98133c9..5e8d0a0c91bc7719de2a8c5627b354cf905a4db0
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25135,10 +25135,11 @@ aarch64_copy_one_block_and_progress_pointers (rtx 
> *src, rtx *dst,
>*dst = aarch64_progress_pointer (*dst);
>  }
>
> -/* Expand a cpymem using the MOPS extension.  OPERANDS are taken
> -   from the cpymem pattern.  Return true iff we succeeded.  */
> +/* Expand a cpymem/movmem using the MOPS extension.  OPERANDS are taken
> +   from the cpymem/movmem pattern.  IS_MEMMOVE is true if this is a memmove
> +   rather than memcpy.  Return true iff we succeeded.  */
>  static bool
> -aarch64_expand_cpymem_mops (rtx *operands)
> +aarch64_expand_cpymem_mops (rtx *operands, bool is_memmove)
>  {
>if (!TARGET_MOPS)
>  return false;
> @@ -25150,17 +25151,19 @@ aarch64_expand_cpymem_mops (rtx *operands)
>rtx dst_mem = replace_equiv_address (operands[0], dst_addr);
>rtx src_mem = replace_equiv_address (operands[1], src_addr);
>rtx sz_reg = copy_to_mode_reg (DImode, operands[2]);
> -  emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
> -
> +  if (is_memmove)
> +emit_insn (gen_aarch64_movmemdi (dst_mem, src_mem, sz_reg));
> +  else
> +emit_insn (gen_aarch64_cpymemdi (dst_mem, src_mem, sz_reg));
>return true;
>  }
>
> -/* Expand cpymem, as if from a __builtin_memcpy.  Return true if
> -   we succeed, otherwise return false, indicating that a libcall to
> -   memcpy should be emitted.  */
> -
> +/* Expand cpymem/movmem, as if from a __builtin_memcpy/memmove.
> +   OPERANDS are taken from the cpymem/movmem pattern.  IS_MEMMOVE is true
> +   if this is a memmove rather than memcpy.  Return true if we succeed,
> +   otherwise return false, indicating that a libcall should be emitted.  */
>  bool
> -aarch64_expand_cpymem (rtx *operands)
> +aarch64_expand_cpymem (rtx *operands, bool is_memmove)
>  {
>int mode_bits;
>rtx dst = operands[0];
> @@ -25168,25 +25171,23 @@ aarch64_expand_cpymem (rtx *operands)
>rtx base;
>machine_mode cur_mode = BLKmode;
>
> -  /* Variable-sized memcpy can go through the MOPS expansion if available.  
> */
> -  if (!CONST_INT_P (operands[2]))
> -return aarch64_expand_cpymem_mops (operands);
> +  /* Variable-sized or strict align copies may use the MOPS expansion.  */
> +  if (!CONST_INT_P (operands[2]) || STRICT_ALIGNMENT)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
>unsigned HOST_WIDE_INT size = INTVAL (operands[2]);
>
> -  /* Try to inline up to 256 bytes or use the MOPS threshold if available.  
> */
> -  unsigned HOST_WIDE_INT max_copy_size
> -= TARGET_MOPS ? aarch64_mops_memcpy_size_threshold : 256;
> +  /* Set inline limits for memmove/memcpy.  MOPS has a separate threshold.  
> */
> +  unsigned HOST_WIDE_INT max_copy_size = is_memmove ? 0 : 256;
> +  unsigned HOST_WIDE_INT max_mops_size = max_copy_size;
>
> -  bool size_p = 

Re: [PATCH 03/11] aarch64: Use br instead of ret for eh_return

2023-08-23 Thread Richard Sandiford via Gcc-patches
Szabolcs Nagy  writes:
> The expected way to handle eh_return is to pass the stack adjustment
> offset and landing pad address via
>
>   EH_RETURN_STACKADJ_RTX
>   EH_RETURN_HANDLER_RTX
>
> to the epilogue that is shared between normal return paths and the
> eh_return paths.  EH_RETURN_HANDLER_RTX is the stack slot of the
> return address that is overwritten with the landing pad in the
> eh_return case and EH_RETURN_STACKADJ_RTX is a register added to sp
> right before return and it is set to 0 in the normal return case.
>
> The issue with this design is that eh_return and normal return may
> require different return sequence but there is no way to distinguish
> the two cases in the epilogue (the stack adjustment may be 0 in the
> eh_return case too).
>
> The reason eh_return and normal return requires different return
> sequence is that control flow integrity hardening may need to treat
> eh_return as a forward-edge transfer (it is not returning to the
> previous stack frame) and normal return as a backward-edge one.
> In case of AArch64 forward-edge is protected by BTI and requires br
> instruction and backward-edge is protected by PAUTH or GCS and
> requires ret (or authenticated ret) instruction.
>
> This patch resolves the issue by using the EH_RETURN_STACKADJ_RTX
> register only as a flag that is set to 1 in the eh_return paths
> (it is 0 in normal return paths) and introduces
>
>   AARCH64_EH_RETURN_STACKADJ_RTX
>   AARCH64_EH_RETURN_HANDLER_RTX
>
> to pass the actual stack adjustment and landing pad address to the
> epilogue in the eh_return case. Then the epilogue can use the right
> return sequence based on the EH_RETURN_STACKADJ_RTX flag.
>
> The handler could be passed the old way via clobbering the return
> address, but since now the eh_return case can be distinguished, the
> handler can be in a different register than x30 and no stack frame
> is needed for eh_return.

I don't think there's any specific target-independent requirement for
EH_RETURN_HANDLER_RTX to be a stack slot.  df-scan.cc has code to handle
registers.

So couldn't we just use EH_RETURN_HANDLER_RTX for this, rather than
making it AARCH64_EH_RETURN_HANDLER_RTX?

> The new code generation for functions with eh_return is not amazing,
> since x5 and x6 is assumed to be used by the epilogue even in the
> normal return path, not just for eh_return.  But only the unwinder
> is expected to use eh_return so this is fine.

I guess the problem here is that x5 and x6 are upwards-exposed on
the non-eh_return paths, and so are treated as live for most of the
function.  Is that right?

The patch seems to be using the existing interfaces to implement
a slightly different model.  E.g. if feels like a hack (but a neat hack)
that EH_RETURN_STACKADJ_RTX is now a flag rather than an adjustment,
with AARCH64_EH_RETURN_STACKADJ_RTX then being the "real" stack
adjustment.  And the reason for the upwards exposure of the new
registers on normal return paths is that the existing model has
no hook into the normal return path.

Rather than hiding this in target code, perhaps we should add a
target-independent concept of an "eh_return taken" flag, say
EH_RETURN_TAKEN_RTX.

We could define it so that, on targets that define EH_RETURN_TAKEN_RTX,
a register EH_RETURN_STACKADJ_RTX and a register EH_RETURN_HANDLER_RTX
are only meaningful when the flag is true.  E.g. we could have:

#ifdef EH_RETURN_HANDLER_RTX
  for (rtx tmp : { EH_RETURN_STACKADJ_RTX, EH_RETURN_HANDLER_RTX })
if (tmp && REG_P (tmp))
  emit_clobber (tmp);
#endif

in the "normal return" part of expand_eh_return.  (If some other target
wants a flag with different semantics, it'd be up to them to add it.)

That should avoid most of the bad code-quality effects, since the
specialness of x4-x6 will be confined to the code immediately before
the pre-epilogue exit edges.

Thanks,
Richard

> This patch fixes a return to anywhere gadget in the unwinder with
> existing standard branch protection as well as makes EH return
> compatible with the Guarded Control Stack (GCS) extension.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-protos.h (aarch64_eh_return_handler_rtx):
>   Remove.
>   (aarch64_eh_return): New.
>   * config/aarch64/aarch64.cc (aarch64_return_address_signing_enabled):
>   Sign return address even in functions with eh_return.
>   (aarch64_epilogue_uses): Mark two registers as used.
>   (aarch64_expand_epilogue): Conditionally return with br or ret.
>   (aarch64_eh_return_handler_rtx): Remove.
>   (aarch64_eh_return): New.
>   * config/aarch64/aarch64.h (EH_RETURN_HANDLER_RTX): Remove.
>   (AARCH64_EH_RETURN_STACKADJ_REGNUM): Define.
>   (AARCH64_EH_RETURN_STACKADJ_RTX): Define.
>   (AARCH64_EH_RETURN_HANDLER_REGNUM): Define.
>   (AARCH64_EH_RETURN_HANDLER_RTX): Define.
>   * config/aarch64/aarch64.md (eh_return): New.
> ---
>  gcc/config/aarch64/aarch64-protos.h |   2 +-
>  gcc/config/aarch64/aarch64.cc   | 106 

Re: [PATCH v2] mklog: handle Signed-Off-By, minor cleanup

2023-08-22 Thread Richard Sandiford via Gcc-patches
Marc Poulhiès via Gcc-patches  writes:
> Consider Signed-Off-By lines as part of the ending of the initial
> commit to avoid having these in the middle of the log when the
> changelog part is injected after.
>
> This is particularly usefull with:
>
>  $ git gcc-commit-mklog --amend -s
>
> that can be used to create the changelog and add the Signed-Off-By line.
>
> Also applies most of the shellcheck suggestions on the
> prepare-commit-msg hook.
>
> contrib/ChangeLog:
>
>   * mklog.py: Leave SOB lines after changelog.
>   * prepare-commit-msg: Apply most shellcheck suggestions.
>
> Signed-off-by: Marc Poulhiès 
> ---
> Previous version was missing the ChangeLog.
>
> This command is used in particular during the dev of the frontend
> for the Rust language (see r13-7099-g4b25fc15b925f8 as an example
> of a SoB ending in the middle of the commit message).
>
> Ok for master?
>
>  contrib/mklog.py   | 34 +-
>  contrib/prepare-commit-msg | 20 ++--
>  2 files changed, 39 insertions(+), 15 deletions(-)
>
> diff --git a/contrib/mklog.py b/contrib/mklog.py
> index 777212c98d7..e5cc69e0d0a 100755
> --- a/contrib/mklog.py
> +++ b/contrib/mklog.py
> @@ -41,7 +41,34 @@ from unidiff import PatchSet
>  
>  LINE_LIMIT = 100
>  TAB_WIDTH = 8
> -CO_AUTHORED_BY_PREFIX = 'co-authored-by: '
> +
> +# Initial commit:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  | This is the "start"
> +#   | This is some text explaining the commit. |
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +#
> +# Results in:
> +#   +--+
> +#   | gccrs: Some title|
> +#   |  |
> +#   | This is some text explaining the commit. | This is the "start"
> +#   | There can be several lines.  |
> +#   |  |<--->
> +#   | gcc/rust/ChangeLog:  |
> +#   |  | This is the 
> generated
> +#   | * some_file (bla):   | ChangeLog part
> +#   | (foo):   |
> +#   |  |<--->
> +#   | Signed-off-by: My Name  | This is the "end"
> +#   +--+
> +
> +# this regex matches the first line of the "end" in the initial commit 
> message
> +FIRST_LINE_OF_END_RE = re.compile('(?i)^(signed-off-by|co-authored-by|#): ')

The current code only requires an initial "#", rather than an initial "#: ".
Is that a deliberate change?

The patch LGTM apart from that.

Thanks,
Richard

>  pr_regex = re.compile(r'(\/(\/|\*)|[Cc*!])\s+(?PPR [a-z+-]+\/[0-9]+)')
>  prnum_regex = re.compile(r'PR (?P[a-z+-]+)/(?P[0-9]+)')
> @@ -330,10 +357,7 @@ def update_copyright(data):
>  
>  
>  def skip_line_in_changelog(line):
> -if line.lower().startswith(CO_AUTHORED_BY_PREFIX) or 
> line.startswith('#'):
> -return False
> -return True
> -
> +return FIRST_LINE_OF_END_RE.match(line) == None
>  
>  if __name__ == '__main__':
>  extra_args = os.getenv('GCC_MKLOG_ARGS')
> diff --git a/contrib/prepare-commit-msg b/contrib/prepare-commit-msg
> index 48c9dad3c6f..1e94706ba40 100755
> --- a/contrib/prepare-commit-msg
> +++ b/contrib/prepare-commit-msg
> @@ -32,11 +32,11 @@ if ! [ -f "$COMMIT_MSG_FILE" ]; then exit 0; fi
>  # Don't do anything unless requested to.
>  if [ -z "$GCC_FORCE_MKLOG" ]; then exit 0; fi
>  
> -if [ -z "$COMMIT_SOURCE" ] || [ $COMMIT_SOURCE = template ]; then
> +if [ -z "$COMMIT_SOURCE" ] || [ "$COMMIT_SOURCE" = template ]; then
>  # No source or "template" means new commit.
>  cmd="diff --cached"
>  
> -elif [ $COMMIT_SOURCE = message ]; then
> +elif [ "$COMMIT_SOURCE" = message ]; then
>  # "message" means -m; assume a new commit if there are any changes 
> staged.
>  if ! git diff --cached --quiet; then
>   cmd="diff --cached"
> @@ -44,23 +44,23 @@ elif [ $COMMIT_SOURCE = message ]; then
>   cmd="diff --cached HEAD^"
>  fi
>  
> -elif [ $COMMIT_SOURCE = commit ]; then
> +elif [ "$COMMIT_SOURCE" = commit ]; then
>  # The message of an existing commit.  If it's HEAD, assume --amend;
>  # otherwise, assume a new commit with -C.
> -if [ $SHA1 = HEAD ]; then
> +if [ "$SHA1" = HEAD ]; then
>   cmd="diff --cached HEAD^"
>   if [ "$(git config gcc-config.mklog-hook-type)" = "smart-amend" ]; then
>   # Check if the existing message 

Re: [PATCH] Remove XFAIL from gcc/testsuite/gcc.dg/unroll-7.c

2023-08-21 Thread Richard Sandiford via Gcc-patches
Thiago Jung Bauermann via Gcc-patches  writes:
> This test passes since commit e41103081bfa "Fix undefined behaviour in
> profile_count::differs_from_p", so remove the xfail annotation.
>
> Tested on aarch64-linux-gnu, armv8l-linux-gnueabihf and x86_64-linux-gnu.
>
> gcc/testsuite/ChangeLog:
>   * gcc.dg/unroll-7.c: Remove xfail.

Thanks, pushed to trunk.  Sorry for the slow response.

Richard

> ---
>  gcc/testsuite/gcc.dg/unroll-7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/gcc.dg/unroll-7.c b/gcc/testsuite/gcc.dg/unroll-7.c
> index 650448df5db1..17c5e533c2cb 100644
> --- a/gcc/testsuite/gcc.dg/unroll-7.c
> +++ b/gcc/testsuite/gcc.dg/unroll-7.c
> @@ -15,4 +15,4 @@ int t(void)
>  /* { dg-final { scan-rtl-dump "upper bound: 99" "loop2_unroll" } } */
>  /* { dg-final { scan-rtl-dump "realistic bound: 99" "loop2_unroll" } } */
>  /* { dg-final { scan-rtl-dump "considering unrolling loop with constant 
> number of iterations" "loop2_unroll" } } */
> -/* { dg-final { scan-rtl-dump-not "Invalid sum" "loop2_unroll" {xfail *-*-* 
> } } } */
> +/* { dg-final { scan-rtl-dump-not "Invalid sum" "loop2_unroll" } } */
>
> base-commit: 5da4c0b85a97727e6802eaf3a0d47bcdb8da5f51


Re: [PATCH] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-21 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Wed, 16 Aug 2023, Juzhe-Zhong wrote:
>
>> Hi, Richard and Richi.
>> 
>> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
>> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>> 
>> Consider this following case:
>> #define TEST_TYPE(TYPE)  
>>   \
>>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,  
>>   \
>>TYPE *__restrict a,  \
>>TYPE *__restrict b, int n)   \
>>   {  
>>   \
>> for (int i = 0; i < n; i++)  
>>   \
>>   dst[i] -= a[i] * b[i];   \
>>   }
>> 
>> #define TEST_ALL()   
>>   \
>>   TEST_TYPE (float)  
>>   \
>> 
>> TEST_ALL ()
>> 
>> Gimple IR for RVV:
>> 
>> ...
>> _39 = -vect__8.14_26;
>> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
>> vect__4.8_34, vect__4.8_34, _46, 0);
>> ...
>> 
>> This is because this following piece of codes in tree-ssa-math-opts.cc:
>> 
>>   if (len)
>>  fma_stmt
>>= gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>>  addop, else_value, len, bias);
>>   else if (cond)
>>  fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>> op2, addop, else_value);
>>   else
>>  fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>> use_stmt));
>>   gsi_replace (, fma_stmt, true);
>>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>>   regardless of where the negation occurs.  */
>>   gimple *orig_stmt = gsi_stmt (gsi);
>>   if (fold_stmt (, follow_all_ssa_edges))
>>  {
>>if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>>  gcc_unreachable ();
>>update_stmt (gsi_stmt (gsi));
>>  }
>> 
>> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>> 
>> This patch support STMT fold into:
>> 
>> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
>> vect__4.8_34, { 0.0, ... }, _46, 0);
>> 
>> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>> 
>> Extend maximum num ops:
>> -  static const unsigned int MAX_NUM_OPS = 5;
>> +  static const unsigned int MAX_NUM_OPS = 7;
>> 
>> Bootstrap and Regtest on X86 passed.
>> 
>> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>> 
>> Testing on aarch64 is on progress.
>> 
>> gcc/ChangeLog:
>> 
>> * genmatch.cc (decision_tree::gen): Support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
>> * gimple-match-exports.cc (gimple_simplify): Ditto.
>> (gimple_resimplify6): New function.
>> (gimple_resimplify7): New function.
>> (gimple_match_op::resimplify): Support 
>> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
>> (convert_conditional_op): Ditto.
>> (build_call_internal): Ditto.
>> (try_conditional_simplification): Ditto.
>> (gimple_extract): Ditto.
>> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
>> * internal-fn.cc (CASE): Ditto.
>> 
>> ---
>>  gcc/genmatch.cc |   2 +-
>>  gcc/gimple-match-exports.cc | 124 ++--
>>  gcc/gimple-match.h  |  19 +-
>>  gcc/internal-fn.cc  |  11 ++--
>>  4 files changed, 144 insertions(+), 12 deletions(-)
>> 
>> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
>> index f46d2e1520d..a1925a747a7 100644
>> --- a/gcc/genmatch.cc
>> +++ b/gcc/genmatch.cc
>> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>>  }
>>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>>  
>> -  for (unsigned n = 1; n <= 5; ++n)
>> +  for (unsigned n = 1; n <= 7; ++n)
>>  {
>>bool has_kids_p = false;
>>  
>> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
>> index 7aeb4ddb152..895950309b7 100644
>> --- a/gcc/gimple-match-exports.cc
>> +++ b/gcc/gimple-match-exports.cc
>> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, 
>> gimple_seq *, tree (*)(tree),
>>   code_helper, tree, tree, tree, tree, tree);
>>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree 
>> (*)(tree),
>>   code_helper, tree, 

Re: [PATCH] gimple_fold: Support COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold

2023-08-21 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> Hi, Richard and Richi.
>
> Currently, GCC support COND_LEN_FMA for floating-point **NO** -ffast-math.
> It's supported in tree-ssa-math-opts.cc. However, GCC failed to support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS.
>
> Consider this following case:
> #define TEST_TYPE(TYPE)   
>  \
>   __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,   
>  \
> TYPE *__restrict a,  \
> TYPE *__restrict b, int n)   \
>   {   
>  \
> for (int i = 0; i < n; i++)   
>  \
>   dst[i] -= a[i] * b[i];   \
>   }
>
> #define TEST_ALL()
>  \
>   TEST_TYPE (float)   
>  \
>
> TEST_ALL ()
>
> Gimple IR for RVV:
>
> ...
> _39 = -vect__8.14_26;
> vect__10.16_21 = .COND_LEN_FMA ({ -1, ... }, vect__6.11_30, _39, 
> vect__4.8_34, vect__4.8_34, _46, 0);
> ...
>
> This is because this following piece of codes in tree-ssa-math-opts.cc:
>
>   if (len)
>   fma_stmt
> = gimple_build_call_internal (IFN_COND_LEN_FMA, 7, cond, mulop1, op2,
>   addop, else_value, len, bias);
>   else if (cond)
>   fma_stmt = gimple_build_call_internal (IFN_COND_FMA, 5, cond, mulop1,
>  op2, addop, else_value);
>   else
>   fma_stmt = gimple_build_call_internal (IFN_FMA, 3, mulop1, op2, addop);
>   gimple_set_lhs (fma_stmt, gimple_get_lhs (use_stmt));
>   gimple_call_set_nothrow (fma_stmt, !stmt_can_throw_internal (cfun,
>  use_stmt));
>   gsi_replace (, fma_stmt, true);
>   /* Follow all SSA edges so that we generate FMS, FNMA and FNMS
>regardless of where the negation occurs.  */
>   gimple *orig_stmt = gsi_stmt (gsi);
>   if (fold_stmt (, follow_all_ssa_edges))
>   {
> if (maybe_clean_or_replace_eh_stmt (orig_stmt, gsi_stmt (gsi)))
>   gcc_unreachable ();
> update_stmt (gsi_stmt (gsi));
>   }
>
> 'fold_stmt' failed to fold NEGATE_EXPR + COND_LEN_FMA > COND_LEN_FNMA.
>
> This patch support STMT fold into:
>
> vect__10.16_21 = .COND_LEN_FNMA ({ -1, ... }, vect__8.14_26, vect__6.11_30, 
> vect__4.8_34, { 0.0, ... }, _46, 0);
>
> Note that COND_LEN_FNMA has 7 arguments and COND_LEN_ADD has 6 arguments.
>
> Extend maximum num ops:
> -  static const unsigned int MAX_NUM_OPS = 5;
> +  static const unsigned int MAX_NUM_OPS = 7;
>
> Bootstrap and Regtest on X86 passed.
>
> Fully tested COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS on RISC-V backend.
>
> Testing on aarch64 is on progress.
>
> gcc/ChangeLog:
>
> * genmatch.cc (decision_tree::gen): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> * gimple-match-exports.cc (gimple_simplify): Ditto.
> (gimple_resimplify6): New function.
> (gimple_resimplify7): New function.
> (gimple_match_op::resimplify): Support 
> COND_LEN_FNMA/COND_LEN_FMS/COND_LEN_FNMS gimple fold.
> (convert_conditional_op): Ditto.
> (build_call_internal): Ditto.
> (try_conditional_simplification): Ditto.
> (gimple_extract): Ditto.
> * gimple-match.h (gimple_match_cond::gimple_match_cond): Ditto.
> * internal-fn.cc (CASE): Ditto.
>
> ---
>  gcc/genmatch.cc |   2 +-
>  gcc/gimple-match-exports.cc | 124 ++--
>  gcc/gimple-match.h  |  19 +-
>  gcc/internal-fn.cc  |  11 ++--
>  4 files changed, 144 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> index f46d2e1520d..a1925a747a7 100644
> --- a/gcc/genmatch.cc
> +++ b/gcc/genmatch.cc
> @@ -4052,7 +4052,7 @@ decision_tree::gen (vec  , bool gimple)
>  }
>fprintf (stderr, "removed %u duplicate tails\n", rcnt);
>  
> -  for (unsigned n = 1; n <= 5; ++n)
> +  for (unsigned n = 1; n <= 7; ++n)
>  {
>bool has_kids_p = false;
>  
> diff --git a/gcc/gimple-match-exports.cc b/gcc/gimple-match-exports.cc
> index 7aeb4ddb152..895950309b7 100644
> --- a/gcc/gimple-match-exports.cc
> +++ b/gcc/gimple-match-exports.cc
> @@ -60,6 +60,12 @@ extern bool gimple_simplify (gimple_match_op *, gimple_seq 
> *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree);
>  extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
>code_helper, tree, tree, tree, tree, tree, tree);
> +extern bool gimple_simplify (gimple_match_op *, gimple_seq *, tree (*)(tree),
> +  code_helper, tree, tree, tree, tree, 

Re: [PATCH] tree-optimization/111048 - avoid flawed logic in fold_vec_perm

2023-08-21 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Mon, 21 Aug 2023 at 12:26, Richard Biener  wrote:
>>
>> On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote:
>>
>> > On Fri, 18 Aug 2023 at 14:52, Richard Biener  wrote:
>> > >
>> > > On Fri, 18 Aug 2023, Richard Sandiford wrote:
>> > >
>> > > > Richard Biener  writes:
>> > > > > The following avoids running into somehow flawed logic in 
>> > > > > fold_vec_perm
>> > > > > for non-VLA vectors.
>> > > > >
>> > > > > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>> > > > >
>> > > > > Richard.
>> > > > >
>> > > > > PR tree-optimization/111048
>> > > > > * fold-const.cc (fold_vec_perm_cst): Check for non-VLA
>> > > > > vectors first.
>> > > > >
>> > > > > * gcc.dg/torture/pr111048.c: New testcase.
>> > > >
>> > > > Please don't do this as a permanent thing.  It was a deliberate choice
>> > > > to have the is_constant be the fallback, so that the "generic" 
>> > > > (VLA+VLS)
>> > > > logic gets more coverage.  Like you say, if something is wrong for VLS
>> > > > then the chances are that it's also wrong for VLA.
>> > >
>> > > Sure, feel free to undo this change together with the fix for the
>> > > VLA case.
>> > Hi,
>> > The attached patch reverts the workaround, and fixes the issue.
>> > Bootstrapped+tested on aarch64-linux-gnu with and without SVE, and
>> > x64_64-linux-gnu.
>> > OK to commit ?
>>
>> OK.
> Thanks, committed to trunk in 649388462e9a3c2de0b90ce525de8044704cc521

Thanks for the patch.  Please remember to close the PR too.

Richard


Re: [PATCH] c: Add support for [[__extension__ ...]]

2023-08-18 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Joseph Myers  writes:
>> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>>
>>> Would it be OK to add support for:
>>> 
>>>   [[__extension__ ...]]
>>> 
>>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>>
>> That seems like a plausible feature to add.
>
> Thanks.  Of course, once I actually tried it, I hit a snag:
> :: isn't a single lexing token prior to C2X, and so something like:
>
>   [[__extension__ arm::streaming]]
>
> would not be interpreted as a scoped attribute in C11.  The patch
> gets around that by allowing two colons in place of :: when
> __extension__ is used.  I realise that's pushing the bounds of
> acceptability though...
>
> I wondered about trying to require the two colons to be immediately
> adjacent.  But:
>
> (a) There didn't appear to be an existing API to check that, which seemed
> like a red flag.  The closest I could find was get_source_text_between.
>
> Similarly to that, it would in principle be possible to compare
> two expanded locations.  But...
>
> (b) I had a vague impression that locations were allowed to drop column
> information for very large inputs (maybe I'm wrong).
>
> (c) It wouldn't cope with token pasting.
>
> So in the end I just used a simple two-token test, like for [[ and ]].
>
> Bootstrapped & regression-tested on aarch64-linux-gnu.

Gah, as mentioned yesterday, the patch was peeking the wrong token.
I've fixed that, and added corresponding tests.  Sorry for missing
it first time.

Richard

-

[[]] attributes are a recent addition to C, but as a GNU extension,
GCC allows them to be used in C11 and earlier.  Normally this use
would trigger a pedwarn (for -pedantic, -Wc11-c2x-compat, etc.).

This patch allows the pedwarn to be suppressed by starting the
attribute-list with __extension__.

Also, :: is not a single lexing token prior to C2X, so it wasn't
possible to use scoped attributes in C11, even as a GNU extension.
The patch allows two colons to be used in place of :: when
__extension__ is used.  No attempt is made to check whether the
two colons are immediately adjacent.

gcc/
* doc/extend.texi: Document the C [[__extension__ ...]] construct.

gcc/c/
* c-parser.cc (c_parser_std_attribute): Conditionally allow
two colons to be used in place of ::.
(c_parser_std_attribute_list): New function, split out from...
(c_parser_std_attribute_specifier): ...here.  Allow the attribute-list
to start with __extension__.  When it does, also allow two colons
to be used in place of ::.

gcc/testsuite/
* gcc.dg/c2x-attr-syntax-6.c: New test.
* gcc.dg/c2x-attr-syntax-7.c: Likewise.
---
 gcc/c/c-parser.cc| 64 ++--
 gcc/doc/extend.texi  | 27 --
 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c | 62 +++
 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c | 60 ++
 4 files changed, 193 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 33fe7b115ff..ca60c51ddb2 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -5390,10 +5390,18 @@ c_parser_balanced_token_sequence (c_parser *parser)
  ( balanced-token-sequence[opt] )
 
Keywords are accepted as identifiers for this purpose.
-*/
+
+   As an extension, we permit an attribute-specifier to be:
+
+ [ [ __extension__ attribute-list ] ]
+
+   Two colons are then accepted as a synonym for ::.  No attempt is made
+   to check whether the colons are immediately adjacent.  LOOSE_SCOPE_P
+   indicates whether this relaxation is in effect.  */
 
 static tree
-c_parser_std_attribute (c_parser *parser, bool for_tm)
+c_parser_std_attribute (c_parser *parser, bool for_tm,
+   bool loose_scope_p = false)
 {
   c_token *token = c_parser_peek_token (parser);
   tree ns, name, attribute;
@@ -5406,9 +5414,14 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
   name = canonicalize_attr_name (token->value);
   c_parser_consume_token (parser);
-  if (c_parser_next_token_is (parser, CPP_SCOPE))
+  if (c_parser_next_token_is (parser, CPP_SCOPE)
+  || (loose_scope_p
+ && c_parser_next_token_is (parser, CPP_COLON)
+ && c_parser_peek_2nd_token (parser)->type == CPP_COLON))
 {
   ns = name;
+  if (c_parser_next_token_is (parser, CPP_COLON))
+   c_parser_consume_token (parser);
   c_parser_consume_token (parser);
   token = c_parser_peek_token (parser);
   if (token->type != CPP_NAME && token->type != CPP_KEYWORD)
@@ -5481,19 +5494,9 @@ c_

Re: [PATCH] tree-optimization/111048 - avoid flawed logic in fold_vec_perm

2023-08-18 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following avoids running into somehow flawed logic in fold_vec_perm
> for non-VLA vectors.
>
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>
> Richard.
>
>   PR tree-optimization/111048
>   * fold-const.cc (fold_vec_perm_cst): Check for non-VLA
>   vectors first.
>
>   * gcc.dg/torture/pr111048.c: New testcase.

Please don't do this as a permanent thing.  It was a deliberate choice
to have the is_constant be the fallback, so that the "generic" (VLA+VLS)
logic gets more coverage.  Like you say, if something is wrong for VLS
then the chances are that it's also wrong for VLA.

Thanks,
Richard


> ---
>  gcc/fold-const.cc   | 12 ++--
>  gcc/testsuite/gcc.dg/torture/pr111048.c | 24 
>  2 files changed, 30 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr111048.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 5c51c9d91be..144fd7481b3 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10625,6 +10625,11 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>unsigned res_npatterns, res_nelts_per_pattern;
>unsigned HOST_WIDE_INT res_nelts;
>  
> +  if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts))
> +{
> +  res_npatterns = res_nelts;
> +  res_nelts_per_pattern = 1;
> +}
>/* (1) If SEL is a suitable mask as determined by
>   valid_mask_for_fold_vec_perm_cst_p, then:
>   res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> @@ -10634,7 +10639,7 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>   res_npatterns = nelts in result vector.
>   res_nelts_per_pattern = 1.
>   This exception is made so that VLS ARG0, ARG1 and SEL work as before.  
> */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +  else if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>  {
>res_npatterns
>   = std::max (VECTOR_CST_NPATTERNS (arg0),
> @@ -10648,11 +10653,6 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, 
> const vec_perm_indices ,
>  
>res_nelts = res_npatterns * res_nelts_per_pattern;
>  }
> -  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (_nelts))
> -{
> -  res_npatterns = res_nelts;
> -  res_nelts_per_pattern = 1;
> -}
>else
>  return NULL_TREE;
>  
> diff --git a/gcc/testsuite/gcc.dg/torture/pr111048.c 
> b/gcc/testsuite/gcc.dg/torture/pr111048.c
> new file mode 100644
> index 000..475978aae2b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr111048.c
> @@ -0,0 +1,24 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "-mavx2" { target avx2_runtime } } */
> +
> +typedef unsigned char u8;
> +
> +__attribute__((noipa))
> +static void check(const u8 * v) {
> +if (*v != 15) __builtin_trap();
> +}
> +
> +__attribute__((noipa))
> +static void bug(void) {
> +u8 in_lanes[32];
> +for (unsigned i = 0; i < 32; i += 2) {
> +  in_lanes[i + 0] = 0;
> +  in_lanes[i + 1] = ((u8)0xff) >> (i & 7);
> +}
> +
> +check(_lanes[13]);
> +  }
> +
> +int main() {
> +bug();
> +}


Re: [PATCH] c: Add support for [[__extension__ ...]]

2023-08-17 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
>> Am 17.08.2023 um 13:25 schrieb Richard Sandiford via Gcc-patches 
>> :
>> 
>> Joseph Myers  writes:
>>>> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>>>> 
>>>> Would it be OK to add support for:
>>>> 
>>>>  [[__extension__ ...]]
>>>> 
>>>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>>> 
>>> That seems like a plausible feature to add.
>> 
>> Thanks.  Of course, once I actually tried it, I hit a snag:
>> :: isn't a single lexing token prior to C2X, and so something like:
>> 
>>  [[__extension__ arm::streaming]]
>> 
>> would not be interpreted as a scoped attribute in C11.  The patch
>> gets around that by allowing two colons in place of :: when
>> __extension__ is used.  I realise that's pushing the bounds of
>> acceptability though...
>> 
>> I wondered about trying to require the two colons to be immediately
>> adjacent.  But:
>> 
>> (a) There didn't appear to be an existing API to check that, which seemed
>>like a red flag.  The closest I could find was get_source_text_between.
>
> IStR a cop Toben has ->prev_white or so

Ah, thanks.

  if (c_parser_next_token_is (parser, CPP_SCOPE)
  || (loose_scope_p
  && c_parser_next_token_is (parser, CPP_COLON)
  && c_parser_peek_2nd_token (parser)->type == CPP_COLON
  && !(c_parser_peek_2nd_token (parser)->flags & PREV_WHITE)))

seems to work for (i.e. reject):

typedef int [[__extension__ gnu : : vector_size (4)]] g3;
typedef int [[__extension__ gnu :/**/: vector_size (4)]] g13;

but not:

#define BAR :
typedef int [[__extension__ gnu BAR BAR vector_size (4)]] g5;

#define JOIN(A, B) A/**/B
typedef int [[__extension__ gnu JOIN(:,:) vector_size (4)]] g14;

I now realise the patch was peeking at the wrong token.  Will fix,
and add more tests.

Richard


[PATCH] c: Add support for [[__extension__ ...]]

2023-08-17 Thread Richard Sandiford via Gcc-patches
Joseph Myers  writes:
> On Wed, 16 Aug 2023, Richard Sandiford via Gcc-patches wrote:
>
>> Would it be OK to add support for:
>> 
>>   [[__extension__ ...]]
>> 
>> to suppress the pedwarn about using [[]] prior to C2X?  Then we can
>
> That seems like a plausible feature to add.

Thanks.  Of course, once I actually tried it, I hit a snag:
:: isn't a single lexing token prior to C2X, and so something like:

  [[__extension__ arm::streaming]]

would not be interpreted as a scoped attribute in C11.  The patch
gets around that by allowing two colons in place of :: when
__extension__ is used.  I realise that's pushing the bounds of
acceptability though...

I wondered about trying to require the two colons to be immediately
adjacent.  But:

(a) There didn't appear to be an existing API to check that, which seemed
like a red flag.  The closest I could find was get_source_text_between.

Similarly to that, it would in principle be possible to compare
two expanded locations.  But...

(b) I had a vague impression that locations were allowed to drop column
information for very large inputs (maybe I'm wrong).

(c) It wouldn't cope with token pasting.

So in the end I just used a simple two-token test, like for [[ and ]].

Bootstrapped & regression-tested on aarch64-linux-gnu.

Richard



[[]] attributes are a recent addition to C, but as a GNU extension,
GCC allows them to be used in C11 and earlier.  Normally this use
would trigger a pedwarn (for -pedantic, -Wc11-c2x-compat, etc.).

This patch allows the pedwarn to be suppressed by starting the
attribute-list with __extension__.

Also, :: is not a single lexing token prior to C2X, so it wasn't
possible to use scoped attributes in C11, even as a GNU extension.
The patch allows two colons to be used in place of :: when
__extension__ is used.  No attempt is made to check whether the
two colons are immediately adjacent.

gcc/
* doc/extend.texi: Document the C [[__extension__ ...]] construct.

gcc/c/
* c-parser.cc (c_parser_std_attribute): Conditionally allow
two colons to be used in place of ::.
(c_parser_std_attribute_list): New function, split out from...
(c_parser_std_attribute_specifier): ...here.  Allow the attribute-list
to start with __extension__.  When it does, also allow two colons
to be used in place of ::.

gcc/testsuite/
* gcc.dg/c2x-attr-syntax-6.c: New test.
* gcc.dg/c2x-attr-syntax-7.c: Likewise.
---
 gcc/c/c-parser.cc| 68 ++--
 gcc/doc/extend.texi  | 27 --
 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c | 50 +
 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c | 48 +
 4 files changed, 173 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-6.c
 create mode 100644 gcc/testsuite/gcc.dg/c2x-attr-syntax-7.c

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 33fe7b115ff..82e56b28446 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -5390,10 +5390,18 @@ c_parser_balanced_token_sequence (c_parser *parser)
  ( balanced-token-sequence[opt] )
 
Keywords are accepted as identifiers for this purpose.
-*/
+
+   As an extension, we permit an attribute-specifier to be:
+
+ [ [ __extension__ attribute-list ] ]
+
+   Two colons are then accepted as a synonym for ::.  No attempt is made
+   to check whether the colons are immediately adjacent.  LOOSE_SCOPE_P
+   indicates whether this relaxation is in effect.  */
 
 static tree
-c_parser_std_attribute (c_parser *parser, bool for_tm)
+c_parser_std_attribute (c_parser *parser, bool for_tm,
+   bool loose_scope_p = false)
 {
   c_token *token = c_parser_peek_token (parser);
   tree ns, name, attribute;
@@ -5406,9 +5414,18 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
   name = canonicalize_attr_name (token->value);
   c_parser_consume_token (parser);
-  if (c_parser_next_token_is (parser, CPP_SCOPE))
+  if (c_parser_next_token_is (parser, CPP_SCOPE)
+  || (loose_scope_p
+ && c_parser_next_token_is (parser, CPP_COLON)
+ && c_parser_peek_token (parser)->type == CPP_COLON))
 {
   ns = name;
+  if (c_parser_next_token_is (parser, CPP_COLON))
+   {
+ c_parser_consume_token (parser);
+ if (!c_parser_next_token_is (parser, CPP_COLON))
+   gcc_unreachable ();
+   }
   c_parser_consume_token (parser);
   token = c_parser_peek_token (parser);
   if (token->type != CPP_NAME && token->type != CPP_KEYWORD)
@@ -5481,19 +5498,9 @@ c_parser_std_attribute (c_parser *parser, bool for_tm)
 }
 
 static tree
-c_parser_std_attribute_specifier (c_parser *parser, bool for_tm)
+c_parser_std_attribute_list (c_parser *parser, bool for_tm,
+bool loose_scope_p = false)
 {
-  locatio

Re: [PATCH] doc: Fixes to RTL-SSA sample code

2023-08-17 Thread Richard Sandiford via Gcc-patches
Alex Coplan  writes:
> Hi,
>
> This patch fixes up the code examples in the RTL-SSA documentation (the
> sections on making insn changes) to reflect the current API.
>
> The main issues are as follows:
>  - rtl_ssa::recog takes an obstack_watermark & as the first parameter.
>Presumably this is intended to be the change attempt, so I've updated
>the examples to pass this through.
>  - The variants of recog and restrict_movement that take an ignore
>predicate have been renamed with an _ignoring suffix, so I've
>updated callers to use those names.
>  - A couple of minor "obvious" fixes to add a missing address-of
>operator and correct a variable name.
>
> OK for trunk?

OK.  Thanks for doing this.  I'm pretty sure the examples did
compile with one version of the API, but like you say, I forgot
to update it later. :(

Richard

> Thanks,
> Alex
>
> gcc/ChangeLog:
>
>   * doc/rtl.texi: Fix up sample code for RTL-SSA insn changes.
>
> diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
> index 76aeafb8f15..0ed88f58821 100644
> --- a/gcc/doc/rtl.texi
> +++ b/gcc/doc/rtl.texi
> @@ -4964,7 +4964,7 @@ the pass should check whether the new pattern matches a 
> target
>  instruction or satisfies the requirements of an inline asm:
>  
>  @smallexample
> -if (!rtl_ssa::recog (change))
> +if (!rtl_ssa::recog (attempt, change))
>return false;
>  @end smallexample
>  
> @@ -5015,7 +5015,7 @@ if (!rtl_ssa::restrict_movement (change))
>  insn_change_watermark watermark;
>  // Use validate_change etc. to change INSN's pattern.
>  @dots{}
> -if (!rtl_ssa::recog (change)
> +if (!rtl_ssa::recog (attempt, change)
>  || !rtl_ssa::change_is_worthwhile (change))
>return false;
>  
> @@ -5048,7 +5048,7 @@ For example, if a pass is changing exactly two 
> instructions,
>  it might do:
>  
>  @smallexample
> -rtl_ssa::insn_change *changes[] = @{ , change2 @};
> +rtl_ssa::insn_change *changes[] = @{ ,  @};
>  @end smallexample
>  
>  where @code{change1}'s instruction must come before @code{change2}'s.
> @@ -5066,7 +5066,7 @@ in the correct order with respect to each other.
>  The way to do this is:
>  
>  @smallexample
> -if (!rtl_ssa::restrict_movement (change, insn_is_changing (changes)))
> +if (!rtl_ssa::restrict_movement_ignoring (change, insn_is_changing 
> (changes)))
>return false;
>  @end smallexample
>  
> @@ -5078,7 +5078,7 @@ changing instructions (which might, for example, no 
> longer need
>  to clobber the flags register).  The way to do this is:
>  
>  @smallexample
> -if (!rtl_ssa::recog (change, insn_is_changing (changes)))
> +if (!rtl_ssa::recog_ignoring (attempt, change, insn_is_changing (changes)))
>return false;
>  @end smallexample
>  
> @@ -5118,28 +5118,28 @@ Putting all this together, the process for a 
> two-instruction change is:
>  @smallexample
>  auto attempt = crtl->ssa->new_change_attempt ();
>  
> -rtl_ssa::insn_change change (insn1);
> +rtl_ssa::insn_change change1 (insn1);
>  change1.new_defs = @dots{};
>  change1.new_uses = @dots{};
>  change1.move_range = @dots{};
>  
> -rtl_ssa::insn_change change (insn2);
> +rtl_ssa::insn_change change2 (insn2);
>  change2.new_defs = @dots{};
>  change2.new_uses = @dots{};
>  change2.move_range = @dots{};
>  
> -rtl_ssa::insn_change *changes[] = @{ , change2 @};
> +rtl_ssa::insn_change *changes[] = @{ ,  @};
>  
>  auto is_changing = insn_is_changing (changes);
> -if (!rtl_ssa::restrict_movement (change1, is_changing)
> -|| !rtl_ssa::restrict_movement (change2, is_changing))
> +if (!rtl_ssa::restrict_movement_ignoring (change1, is_changing)
> +|| !rtl_ssa::restrict_movement_ignoring (change2, is_changing))
>return false;
>  
>  insn_change_watermark watermark;
>  // Use validate_change etc. to change INSN1's and INSN2's patterns.
>  @dots{}
> -if (!rtl_ssa::recog (change1, is_changing)
> -|| !rtl_ssa::recog (change2, is_changing)
> +if (!rtl_ssa::recog_ignoring (attempt, change1, is_changing)
> +|| !rtl_ssa::recog_ignoring (attempt, change2, is_changing)
>  || !rtl_ssa::changes_are_worthwhile (changes)
>  || !crtl->ssa->verify_insn_changes (changes))
>return false;


Re: [WIP RFC] Add support for keyword-based attributes

2023-08-16 Thread Richard Sandiford via Gcc-patches
Joseph Myers  writes:
> On Mon, 17 Jul 2023, Michael Matz via Gcc-patches wrote:
>
>> So, essentially you want unignorable attributes, right?  Then implement 
>> exactly that: add one new keyword "__known_attribute__" (invent a better 
>> name, maybe :) ), semantics exactly as with __attribute__ (including using 
>> the same underlying lists in our data structures), with only one single 
>> deviation: instead of the warning you give an error for unhandled 
>> attributes.  Done.
>
> Assuming you also want the better-defined standard rules about how [[]] 
> attributes appertain to particular entities, rather than the different 
> __attribute__ rules, that would suggest something like [[!some::attr]] for 
> the case of attributes that can't be ignored but otherwise are handled 
> like standard [[]] attributes.

Yeah, that would work.  But I'd rather not gate the SME work on getting
an extension like that into C and C++.

As it stands, some clang maintainers pushed back against the use of
attributes for important semantics, and preferred keywords instead.
It's clear from this threads that the GCC maintainers prefer attributes
to keywords.  (And it turns out that some other clang maintainers do too,
though not as strongly.)

So I think the easiest way of keeping both constituencies happy(-ish)
is to provide both standard attributes and "keywords", but allow
the "keywords" to be macros that expand to standard attributes.

Would it be OK to add support for:

  [[__extension__ ...]]

to suppress the pedwarn about using [[]] prior to C2X?  Then we can
predefine __arm_streaming to [[__extension__ arm::streaming]], etc.

Thanks,
Richard



Re: [PATCH v2][GCC] aarch64: Add support for Cortex-A720 CPU

2023-08-16 Thread Richard Sandiford via Gcc-patches
Richard Ball  writes:
> v2: Add missing PROFILE feature flag.
>
> This patch adds support for the Cortex-A720 CPU to GCC.
>
> No regressions on aarch64-none-elf.
>
> Ok for master?
>
> gcc/ChangeLog:
>
>  * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add Cortex-
>   A720 CPU.
>  * config/aarch64/aarch64-tune.md: Regenerate.
>  * doc/invoke.texi: Document Cortex-A720 CPU.

OK, thanks.

Richard

>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> dbac497ef3aab410eb81db185b2e9532186888bb..73976e9a4c5e4f0b5c04bc7974e2006ddfd02fff
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -176,6 +176,8 @@ AARCH64_CORE("cortex-a710",  cortexa710, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG,
>  
>  AARCH64_CORE("cortex-a715",  cortexa715, cortexa57, V9A,  (SVE2_BITPERM, 
> MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1)
>  
> +AARCH64_CORE("cortex-a720",  cortexa720, cortexa57, V9_2A,  (SVE2_BITPERM, 
> MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
> +
>  AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
> I8MM, BF16), neoversen2, 0x41, 0xd48, -1)
>  
>  AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
> I8MM, BF16), neoversen2, 0x41, 0xd4e, -1)
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 2170980dddb0d5d410a49631ad26ff2e346b39dd..12d610f0f6580096eed9cf3de8ad3239efde5e4b
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>  ;; -*- buffer-read-only: t -*-
>  ;; Generated automatically by gentune.sh from aarch64-cores.def
>  (define_attr "tune"
> - 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> + 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
>   (const (symbol_ref "((enum attr_tune) aarch64_tune)")))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 2c870d3c34b587ffc721b1f18f99ecd66d4217be..62537d9d09e25f864c27534b7ac2ec467ea24789
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -20517,7 +20517,8 @@ performance of the code.  Permissible values for this 
> option are:
>  @samp{cortex-a75.cortex-a55}, @samp{cortex-a76.cortex-a55},
>  @samp{cortex-r82}, @samp{cortex-x1}, @samp{cortex-x1c}, @samp{cortex-x2},
>  @samp{cortex-x3}, @samp{cortex-a510}, @samp{cortex-a520}, @samp{cortex-a710},
> -@samp{cortex-a715}, @samp{ampere1}, @samp{ampere1a}, and @samp{native}.
> +@samp{cortex-a715}, @samp{cortex-a720}, @samp{ampere1}, @samp{ampere1a},
> +and @samp{native}.
>  
>  The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
>  @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},


Re: [PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-16 Thread Richard Sandiford via Gcc-patches
Robin Dapp  writes:
>> However:
>> 
>> | #define vec_extract_direct { 3, 3, false }
>> 
>> This looks wrong.  The numbers are argument numbers (or -1 for a return
>> value).  vec_extract only takes 2 arguments, so 3 looks to be out-of-range.
>> 
>> | #define direct_vec_extract_optab_supported_p direct_optab_supported_p
>> 
>> I would expect this to be convert_optab_supported_p.
>> 
>> On the promoted subreg thing, I think expand_vec_extract_optab_fn
>> should use expand_fn_using_insn.
>
> Thanks, really easier that way.  Attached a new version that's currently
> bootstrapping.  Does that look better?

LGTM, thanks.  OK if testing passes.

Richard

> Regards
>  Robin
>
> Subject: [PATCH v2] internal-fn: Fix vector extraction into promoted subreg.
>
> This patch fixes the case where vec_extract gets passed a promoted
> subreg (e.g. from a return value).  This is achieved by using
> expand_convert_optab_fn instead of a separate expander function.
>
> gcc/ChangeLog:
>
>   * internal-fn.cc (vec_extract_direct): Change type argument
>   numbers.
>   (expand_vec_extract_optab_fn): Call convert_optab_fn.
>   (direct_vec_extract_optab_supported_p): Use
>   convert_optab_supported_p.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c: New test.
>   * gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c: New test.
> ---
>  gcc/internal-fn.cc|  44 +-
>  .../rvv/autovec/vls-vlmax/vec_extract-1u.c|  63 
>  .../rvv/autovec/vls-vlmax/vec_extract-2u.c|  69 +
>  .../rvv/autovec/vls-vlmax/vec_extract-3u.c|  69 +
>  .../rvv/autovec/vls-vlmax/vec_extract-4u.c|  70 +
>  .../rvv/autovec/vls-vlmax/vec_extract-runu.c  | 137 ++
>  6 files changed, 413 insertions(+), 39 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-1u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-2u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-3u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-4u.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/vec_extract-runu.c
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 4f2b20a79e5..5cce36a789b 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -175,7 +175,7 @@ init_internal_fns ()
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> -#define vec_extract_direct { 3, 3, false }
> +#define vec_extract_direct { 0, -1, false }
>  #define unary_direct { 0, 0, true }
>  #define unary_convert_direct { -1, 0, true }
>  #define binary_direct { 0, 0, true }
> @@ -3127,43 +3127,6 @@ expand_vec_set_optab_fn (internal_fn, gcall *stmt, 
> convert_optab optab)
>gcc_unreachable ();
>  }
>  
> -/* Expand VEC_EXTRACT optab internal function.  */
> -
> -static void
> -expand_vec_extract_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
> -{
> -  tree lhs = gimple_call_lhs (stmt);
> -  tree op0 = gimple_call_arg (stmt, 0);
> -  tree op1 = gimple_call_arg (stmt, 1);
> -
> -  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> -
> -  machine_mode outermode = TYPE_MODE (TREE_TYPE (op0));
> -  machine_mode extract_mode = TYPE_MODE (TREE_TYPE (lhs));
> -
> -  rtx src = expand_normal (op0);
> -  rtx pos = expand_normal (op1);
> -
> -  class expand_operand ops[3];
> -  enum insn_code icode = convert_optab_handler (optab, outermode,
> - extract_mode);
> -
> -  if (icode != CODE_FOR_nothing)
> -{
> -  create_output_operand ([0], target, extract_mode);
> -  create_input_operand ([1], src, outermode);
> -  create_convert_operand_from ([2], pos,
> -TYPE_MODE (TREE_TYPE (op1)), true);
> -  if (maybe_expand_insn (icode, 3, ops))
> - {
> -   if (!rtx_equal_p (target, ops[0].value))
> - emit_move_insn (target, ops[0].value);
> -   return;
> - }
> -}
> -  gcc_unreachable ();
> -}
> -
>  static void
>  expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
>  {
> @@ -3917,6 +3880,9 @@ expand_convert_optab_fn (internal_fn fn, gcall *stmt, 
> convert_optab optab,
>  #define expand_unary_convert_optab_fn(FN, STMT, OPTAB) \
>expand_convert_optab_fn (FN, STMT, OPTAB, 1)
>  
> +#define expand_vec_extract_optab_fn(FN, STMT, OPTAB) \
> +  expand_convert_optab_fn (FN, STMT, OPTAB, 2)
> +
>  /* RETURN_TYPE and ARGS are a return type and argument list that are
> in principle compatible with FN 

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-16 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
>> Unfortunately, the patch regressed following tests on ppc64le and
>> armhf respectively:
>> gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
>> "VIEW_CONVERT_EXPR"
>> gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"
>>
>> This happens because of the change to vect_cst_ctor_array which
>> removes handling of VECTOR_CST,
>> and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
>> vector_cst, ctor input operands.
>>
>> For eg we fail to fold VEC_PERM_EXPR for the following test taken from
>> forwprop-20.c:
>> void f (double d, vecf* r)
>> {
>>   vecf x = { -d, 5 };
>>   vecf y = {  1, 4 };
>>   veci m = {  2, 0 };
>>   *r = __builtin_shuffle (x, y, m); // { 1, -d }
>> }
>> because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 
>> 4}.
>>
>> The attached patch thus reverts the changes to vect_cst_ctor_to_array,
>> which makes the tests pass again.
>> I have put the patch for another round of bootstrap+test on the above
>> targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
>> OK to commit if it passes ?
> The patch now passes bootstrap+test on all these targets.

OK, thanks.

Richard


Re: [PATCH] IFN: Fix vector extraction into promoted subreg.

2023-08-16 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Robin, Richard and Richi.
>
> I am wondering whether we can just simply replace the VEC_EXTRACT expander 
> with binary?
>
> Like this :?
>
> DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT, ECF_CONST | ECF_NOTHROW,
> -  vec_extract, vec_extract)
> +  vec_extract, binary)
>
> to fix the sign extend issue.
>
> And remove the vec_extract explicit expander in internal-fn.cc ?

I'm not sure how that would work.  The vec_extract optab takes two
modes whereas binary optabs take one mode.

However:

| #define vec_extract_direct { 3, 3, false }

This looks wrong.  The numbers are argument numbers (or -1 for a return
value).  vec_extract only takes 2 arguments, so 3 looks to be out-of-range.

| #define direct_vec_extract_optab_supported_p direct_optab_supported_p

I would expect this to be convert_optab_supported_p.

On the promoted subreg thing, I think expand_vec_extract_optab_fn
should use expand_fn_using_insn.

Thanks,
Richard


Re: [PATCH] Handle TYPE_OVERFLOW_UNDEFINED vectorized BB reductions

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following changes the gate to perform vectorization of BB reductions
> to use needs_fold_left_reduction_p which in turn requires handling
> TYPE_OVERFLOW_UNDEFINED types in the epilogue code generation by
> promoting any operations generated there to use unsigned arithmetic.
>
> The following does this, there's currently only v16qi where x86
> supports a .REDUC_PLUS reduction for integral modes so I had to
> add a x86 specific testcase using GIMPLE IL.
>
> Bootstrap and regtest ongoing on x86_64-unknown-linux-gnu.

LGTM FWIW.

> The next plan is to remove the restriction to .REDUC_PLUS, factoring
> out some of the general non-ifn way of doing a reduction epilog
> from loop reduction handling.  I had a stab at doing in-order
> reductions already but then those are really too similar to
> having general SLP discovery from N scalar defs (and then replacing
> those with extracts), at least since there's no
> fold_left_plus that doesn't add to an existing scalar I can't
> seem to easily just handle that case, possibly discovering
> { x_0, x_1, ..., x_n-1 }, extracting x_0, shifting the vector
> to { x_1, ..., x_n-1,  } and using mask_fold_left_plus
> with accumulating to x_0 and the  element masked would do.
> But I'm not sure that's worth the trouble?

Yeah, I doubt it.  I don't think SVE's FADDA is expected to be an
optimisation in its own right.  It's more of an enabler.

Another reason to use it in loops is that it's VLA-friendly.
But that wouldn't be an issue here.

Thanks,
Richard

> In principle with generic N scalar defs we could do a forward
> discovery from grouped loads and see where that goes (and of
> course handle in-order reductions that way).
>
>   * tree-vect-slp.cc (vect_slp_check_for_roots): Use
>   !needs_fold_left_reduction_p to decide whether we can
>   handle the reduction with association.
>   (vectorize_slp_instance_root_stmt): For TYPE_OVERFLOW_UNDEFINED
>   reductions perform all arithmetic in an unsigned type.
>
>   * gcc.target/i386/vect-reduc-2.c: New testcase.
> ---
>  gcc/testsuite/gcc.target/i386/vect-reduc-2.c | 77 
>  gcc/tree-vect-slp.cc | 44 +++
>  2 files changed, 107 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vect-reduc-2.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/vect-reduc-2.c 
> b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c
> new file mode 100644
> index 000..62559ef8e7b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vect-reduc-2.c
> @@ -0,0 +1,77 @@
> +/* { dg-do compile } */
> +/* { dg-options "-fgimple -O2 -msse2 -fdump-tree-slp2-optimized" } */
> +
> +signed char x[16];
> +
> +signed char __GIMPLE (ssa,guessed_local(1073741824))
> +foo ()
> +{
> +  signed char _1;
> +  signed char _3;
> +  signed char _5;
> +  signed char _6;
> +  signed char _8;
> +  signed char _9;
> +  signed char _11;
> +  signed char _12;
> +  signed char _14;
> +  signed char _15;
> +  signed char _17;
> +  signed char _18;
> +  signed char _20;
> +  signed char _21;
> +  signed char _23;
> +  signed char _24;
> +  signed char _26;
> +  signed char _27;
> +  signed char _29;
> +  signed char _30;
> +  signed char _32;
> +  signed char _33;
> +  signed char _35;
> +  signed char _36;
> +  signed char _38;
> +  signed char _39;
> +  signed char _41;
> +  signed char _42;
> +  signed char _44;
> +  signed char _45;
> +  signed char _47;
> +
> +  __BB(2,guessed_local(1073741824)):
> +  _1 = x[15];
> +  _3 = x[1];
> +  _5 = _1 + _3;
> +  _6 = x[2];
> +  _8 = _5 + _6;
> +  _9 = x[3];
> +  _11 = _8 + _9;
> +  _12 = x[4];
> +  _14 = _11 + _12;
> +  _15 = x[5];
> +  _17 = _14 + _15;
> +  _18 = x[6];
> +  _20 = _17 + _18;
> +  _21 = x[7];
> +  _23 = _20 + _21;
> +  _24 = x[8];
> +  _26 = _23 + _24;
> +  _27 = x[9];
> +  _29 = _26 + _27;
> +  _30 = x[10];
> +  _32 = _29 + _30;
> +  _33 = x[11];
> +  _35 = _32 + _33;
> +  _36 = x[12];
> +  _38 = _35 + _36;
> +  _39 = x[13];
> +  _41 = _38 + _39;
> +  _42 = x[14];
> +  _44 = _41 + _42;
> +  _45 = x[0];
> +  _47 = _44 + _45;
> +  return _47;
> +
> +}
> +
> +/* { dg-final { scan-tree-dump "optimized: basic block part vectorized" 
> "slp2" } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 7020bd9fa0e..07d68f2052b 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -7217,13 +7217,10 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
>   }
>else if (!VECTOR_TYPE_P (TREE_TYPE (rhs))
>  && (associative_tree_code (code) || code == MINUS_EXPR)
> -/* ???  The flag_associative_math and TYPE_OVERFLOW_WRAPS
> -   checks pessimize a two-element reduction.  PR54400.
> +/* ???  This pessimizes a two-element reduction.  PR54400.
> ???  In-order reduction could be handled if we only
> traverse one operand chain in vect_slp_linearize_chain.  */
> -&& ((FLOAT_TYPE_P 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
>> OK, fair enough.  So the idea is: see where we end up and then try to
>> improve/factor the APIs in a less peephole way?
>
> Yeah, I think that's the only good way forward.

OK, no objection from me.  Sorry for holding the patch up.

Richard


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 15 Aug 2023, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Tue, 15 Aug 2023, Kewen.Lin wrote:
>> >
>> >> Hi Stefan,
>> >> 
>> >> on 2023/8/15 02:51, Stefan Schulze Frielinghaus wrote:
>> >> > Hi everyone,
>> >> > 
>> >> > I have bootstrapped and regtested the patch below on s390.  For the
>> >> > 64-bit target I do not see any changes regarding the testsuite.  For the
>> >> > 31-bit target I see the following failures:
>> >> > 
>> >> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in 
>> >> > require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal 
>> >> > compiler error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
>> >> > machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
>> >> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler 
>> >> > error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess 
>> >> > errors)
>> >> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> >> > (internal compiler error: in require, at machmode.h:313)
>> >> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test 
>> >> > for excess errors)
>> >> > UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to 
>> >> > produce executable
>> >> > UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  
>> >> > scan-tree-dump-times optimized "\\* 10" 2
>> >> > UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 
>> >> > 10" 2
>> >> > UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation 
>> >> > failed to produce executable
>> >> > UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce 
>> >> > executable
>> >> > UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> >> > compilation failed to produce executable
>> >> > 
>> >> > I've randomely picked pr50451.c and ran gcc against it which results in:
>> >> > 
>> >> > during GIMPLE pass: vect
>> >> > dump file: pr50451.c.174t.vect
>> >> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
>> >> > function ?foo?:
>> >> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1:
>> >> >  internal compiler error: in require, at machmode.h:313
>> >> > 0x1265d21 opt_mode::require() const
>> >> > /gcc-verify-workdir/patched/src/gcc/machmode.h:313
>> >> > 0x1d7e4e9 opt_mode::require() const
>> >> > /gcc-verify-workdir/patched/src/gcc/vec.h:955
>> >> > 0x1d7e4e9 vect_verify_loop_lens
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
>> >> > 0x1da29ab vect_analyze_loop_2
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
>> >> > 0x1da40c7 vect_analyze_loop_1
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
>> >> > 0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
>> >> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
>> >> 

Re: [PATCH][RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
>> On Mon, 7 Aug 2023 at 13:19, Richard Biener  
>> wrote:
>> > It doesn't seem to make a difference for x86.  That said, the "fix" is
>> > probably sticking the correct target on the dump-check, it seems
>> > that vect_fold_extract_last is no longer correct here.
>> Um sorry, I did go thru various checks in target-supports.exp, but not
>> sure which one will be appropriate for this case,
>> and am stuck here :/ Could you please suggest how to proceed ?
>
> Maybe Richard S. knows the magic thing to test, he originally
> implemented the direct conversion support.  I suggest to implement
> such dg-checks if they are not present (I can't find them),
> possibly quite specific to the modes involved (like we have
> other checks with _qi_to_hi suffixes, for float modes maybe
> just _float).

Yeah, can't remember specific selectors for that feature.  TBH I think
most (all?) of the tests were AArch64-specific.

Thanks,
Richard


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 15 Aug 2023, Kewen.Lin wrote:
>
>> Hi Stefan,
>> 
>> on 2023/8/15 02:51, Stefan Schulze Frielinghaus wrote:
>> > Hi everyone,
>> > 
>> > I have bootstrapped and regtested the patch below on s390.  For the
>> > 64-bit target I do not see any changes regarding the testsuite.  For the
>> > 31-bit target I see the following failures:
>> > 
>> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in 
>> > require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
>> > machmode.h:313)
>> > FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
>> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler 
>> > error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess 
>> > errors)
>> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (internal 
>> > compiler error: in require, at machmode.h:313)
>> > FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test for 
>> > excess errors)
>> > UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to 
>> > produce executable
>> > UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  
>> > scan-tree-dump-times optimized "\\* 10" 2
>> > UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 10" 2
>> > UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation 
>> > failed to produce executable
>> > UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce 
>> > executable
>> > UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
>> > compilation failed to produce executable
>> > 
>> > I've randomely picked pr50451.c and ran gcc against it which results in:
>> > 
>> > during GIMPLE pass: vect
>> > dump file: pr50451.c.174t.vect
>> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
>> > function ?foo?:
>> > /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1: 
>> > internal compiler error: in require, at machmode.h:313
>> > 0x1265d21 opt_mode::require() const
>> > /gcc-verify-workdir/patched/src/gcc/machmode.h:313
>> > 0x1d7e4e9 opt_mode::require() const
>> > /gcc-verify-workdir/patched/src/gcc/vec.h:955
>> > 0x1d7e4e9 vect_verify_loop_lens
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
>> > 0x1da29ab vect_analyze_loop_2
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
>> > 0x1da40c7 vect_analyze_loop_1
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
>> > 0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
>> > /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
>> > 0x1deed27 try_vectorize_loop_1
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1064
>> > 0x1deed27 try_vectorize_loop
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1180
>> > 0x1def5c1 execute
>> > /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1296
>> > Please submit a full bug report, with preprocessed 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin  wrote:
>>
>> on 2023/8/14 22:16, Richard Sandiford wrote:
>> > No, it was more that 219-142=77, so it seems like a lot of lines
>> > are being duplicated rather than simply being moved.  (Unlike for
>> > VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>> > was a clear improvement.)
>> >
>> > So I was just wondering if there was any obvious factoring-out that
>> > could be done to reduce the duplication.
>>
>> ah, thanks for the clarification!
>>
>> I think the main duplication are on the loop body beginning and end,
>> let's take a look at them in details:
>>
>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>> +{
>> +  gcc_assert (alignment_support_scheme == dr_aligned
>> + || alignment_support_scheme == dr_unaligned_supported);
>> +  gcc_assert (!grouped_load && !slp_perm);
>> +
>> +  unsigned int inside_cost = 0, prologue_cost = 0;
>>
>> // These above are newly added.
>>
>> +  for (j = 0; j < ncopies; j++)
>> +   {
>> + /* 1. Create the vector or array pointer update chain.  */
>> + if (j == 0 && !costing_p)
>> +   {
>> + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> +slp_node, _info, 
>> _ptr,
>> +_offsets);
>> + else
>> +   dataref_ptr
>> + = vect_create_data_ref_ptr (vinfo, first_stmt_info, 
>> aggr_type,
>> + at_loop, offset, , gsi,
>> + _incr, false, bump);
>> +   }
>> + else if (!costing_p)
>> +   {
>> + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +  gsi, stmt_info, bump);
>> +   }
>>
>> // These are for dataref_ptr, in the final looop nest we deal with more cases
>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared 
>> between,
>> IMHO factoring out it seems like a overkill.
>>
>> +
>> + if (mask && !costing_p)
>> +   vec_mask = vec_masks[j];
>>
>> // It's merged out from j == 0 and j != 0
>>
>> +
>> + gimple *new_stmt = NULL;
>> + for (i = 0; i < vec_num; i++)
>> +   {
>> + tree final_mask = NULL_TREE;
>> + tree final_len = NULL_TREE;
>> + tree bias = NULL_TREE;
>> + if (!costing_p)
>> +   {
>> + if (loop_masks)
>> +   final_mask
>> + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +   vec_num * ncopies, vectype,
>> +   vec_num * j + i);
>> + if (vec_mask)
>> +   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
>> +  final_mask, vec_mask, 
>> gsi);
>> +
>> + if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, 
>> ptr_incr,
>> +  gsi, stmt_info, bump);
>> +   }
>>
>> // This part is directly copied from the original, the original gets updated 
>> by
>> removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't consider
>> this before, do you prefer me to factor this part out?
>>
>> + if (gs_info.ifn != IFN_LAST)
>> +   {
>> ...
>> +   }
>> + else
>> +   {
>> + /* Emulated gather-scatter.  */
>> ...
>>
>> // This part is just moved from the original.
>>
>> + vec_dest = vect_create_destination_var (scalar_dest, vectype);
>> + /* DATA_REF is null if we've already built the statement.  */
>> + if (data_ref)
>> +   {
>> + vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr));
>> + new_stmt = gimple_build_assign (vec_dest, data_ref);
>> +   }
>> + new_temp = make_ssa_name (vec_dest, new_stmt);
>> + gimple_set_lhs (new_stmt, new_temp);
>> + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
>> +
>> + /* Store vector loads in the corresponding SLP_NODE.  */
>> + if (slp)
>> +   slp_node->push_vec_def (new_stmt);
>> +
>> + if (!slp && !costing_p)
>> +   STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>> +   }
>> +
>> +  if (!slp && !costing_p)
>> +   

Re: [PATCH] Add support for vector conitional not

2023-08-14 Thread Richard Sandiford via Gcc-patches
Andrew Pinski via Gcc-patches  writes:
> Like the support conditional neg (r12-4470-g20dcda98ed376cb61c74b2c71),
> this just adds conditional not too.
> Also we should be able to turn `(a ? -1 : 0) ^ b` into a conditional
> not.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu and aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   * internal-fn.def (COND_NOT): New internal function.
>   * match.pd (UNCOND_UNARY, COND_UNARY): Add bit_not/not
>   to the lists.
>   (`vec (a ? -1 : 0) ^ b`): New pattern to convert
>   into conditional not.
>   * optabs.def (cond_one_cmpl): New optab.
>   (cond_len_one_cmpl): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/110986
>   * gcc.target/aarch64/sve/cond_unary_9.c: New test.
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 15 --
>  gcc/optabs.def|  2 ++
>  .../gcc.target/aarch64/sve/cond_unary_9.c | 20 +++
>  4 files changed, 37 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index b3c410f4b6a..3e8693dfddb 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -69,6 +69,7 @@ along with GCC; see the file COPYING3.  If not see
>   lround2.
>  
> - cond_binary: a conditional binary optab, such as cond_add
> +   - cond_unary: a conditional unary optab, such as cond_neg
> - cond_ternary: a conditional ternary optab, such as cond_fma_rev
>  
> - fold_left: for scalar = FN (scalar, vector), keyed off the vector mode
> @@ -276,6 +277,7 @@ DEF_INTERNAL_COND_FN (FNMA, ECF_CONST, fnma, ternary)
>  DEF_INTERNAL_COND_FN (FNMS, ECF_CONST, fnms, ternary)
>  
>  DEF_INTERNAL_COND_FN (NEG, ECF_CONST, neg, unary)
> +DEF_INTERNAL_COND_FN (NOT, ECF_CONST, one_cmpl, unary)
>  
>  DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
>  
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 6791060891d..2ee6d24ccee 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -84,9 +84,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  
>  /* Unary operations and their associated IFN_COND_* function.  */
>  (define_operator_list UNCOND_UNARY
> -  negate)
> +  negate bit_not)
>  (define_operator_list COND_UNARY
> -  IFN_COND_NEG)
> +  IFN_COND_NEG IFN_COND_NOT)
>  
>  /* Binary operations and their associated IFN_COND_* function.  */
>  (define_operator_list UNCOND_BINARY
> @@ -8482,6 +8482,17 @@ and,
>  && is_truth_type_for (op_type, TREE_TYPE (@0)))
>   (cond_op (bit_not @0) @2 @1)
>  
> +/* `(a ? -1 : 0) ^ b` can be converted into a conditional not.  */
> +(simplify
> + (bit_xor:c (vec_cond @0 uniform_integer_cst_p@1 uniform_integer_cst_p@2) @3)
> + (if (canonicalize_math_after_vectorization_p ()
> +  && vectorized_internal_fn_supported_p (IFN_COND_NOT, type)
> +  && is_truth_type_for (type, TREE_TYPE (@0)))
> + (if (integer_all_onesp (@1) && integer_zerop (@2))
> +  (IFN_COND_NOT @0 @3 @3))
> +  (if (integer_all_onesp (@2) && integer_zerop (@1))
> +   (vec_cond (bit_not @0) @3 @3

Looks like this should be IFN_COND_NOT rather than vec_cond.

LGTM otherwise, but please give Richi 24hrs to comment.

Thanks,
Richard

> +
>  /* Simplify:
>  
>   a = a1 op a2
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 1ea1947b3b5..a58819bc665 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -254,6 +254,7 @@ OPTAB_D (cond_fms_optab, "cond_fms$a")
>  OPTAB_D (cond_fnma_optab, "cond_fnma$a")
>  OPTAB_D (cond_fnms_optab, "cond_fnms$a")
>  OPTAB_D (cond_neg_optab, "cond_neg$a")
> +OPTAB_D (cond_one_cmpl_optab, "cond_one_cmpl$a")
>  OPTAB_D (cond_len_add_optab, "cond_len_add$a")
>  OPTAB_D (cond_len_sub_optab, "cond_len_sub$a")
>  OPTAB_D (cond_len_smul_optab, "cond_len_mul$a")
> @@ -278,6 +279,7 @@ OPTAB_D (cond_len_fms_optab, "cond_len_fms$a")
>  OPTAB_D (cond_len_fnma_optab, "cond_len_fnma$a")
>  OPTAB_D (cond_len_fnms_optab, "cond_len_fnms$a")
>  OPTAB_D (cond_len_neg_optab, "cond_len_neg$a")
> +OPTAB_D (cond_len_one_cmpl_optab, "cond_len_one_cmpl$a")
>  OPTAB_D (cmov_optab, "cmov$a6")
>  OPTAB_D (cstore_optab, "cstore$a4")
>  OPTAB_D (ctrap_optab, "ctrap$a4")
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
> new file mode 100644
> index 000..d6bc0409630
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_9.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -moverride=sve_width=256 
> -fdump-tree-optimized" } */
> +
> +/* This is a reduced version of cond_unary_5.c */
> +
> +void __attribute__ ((noipa))
> +f (short *__restrict r,
> +   short *__restrict a,
> +   short *__restrict pred)
> +{
> +  for (int i = 0; i < 1024; ++i)
> +r[i] = pred[i] != 0 ? ~(a[i]) : a[i];
> +}
> +
> +/* { dg-final { scan-assembler-times 

Re: [RFC] GCC Security policy

2023-08-14 Thread Richard Sandiford via Gcc-patches
I think it would help to clarify what the aim of the security policy is.
Specifically:

(1) What service do we want to provide to users by classifying one thing
as a security bug and another thing as not a security bug?

(2) What service do we want to provide to the GNU community by the same
classification?

I think it will be easier to agree on the classification if we first
agree on that.

Siddhesh Poyarekar  writes:
> Hi,
>
> Here's the updated draft of the top part of the security policy with all 
> of the recommendations incorporated.
>
> Thanks,
> Sid
>
>
> What is a GCC security bug?
> ===
>
>  A security bug is one that threatens the security of a system or
>  network, or might compromise the security of data stored on it.
>  In the context of GCC there are multiple ways in which this might
>  happen and they're detailed below.
>
> Compiler drivers, programs, libgccjit and support libraries
> ---
>
>  The compiler driver processes source code, invokes other programs
>  such as the assembler and linker and generates the output result,
>  which may be assembly code or machine code.  It is necessary that
>  all source code inputs to the compiler are trusted, since it is
>  impossible for the driver to validate input source code beyond
>  conformance to a programming language standard.
>
>  The GCC JIT implementation, libgccjit, is intended to be plugged
>  into applications to translate input source code in the application
>  context.  Limitations that apply to the compiler
>  driver, apply here too in terms of sanitizing inputs, so it is
>  recommended that inputs are either sanitized by an external program
>  to allow only trusted, safe execution in the context of the
>  application or the JIT execution context is appropriately sandboxed
>  to contain the effects of any bugs in the JIT or its generated code
>  to the sandboxed environment.
>
>  Support libraries such as libiberty, libcc1 libvtv and libcpp have
>  been developed separately to share code with other tools such as
>  binutils and gdb.  These libraries again have similar challenges to
>  compiler drivers.  While they are expected to be robust against
>  arbitrary input, they should only be used with trusted inputs.
>
>  Libraries such as zlib that bundled into GCC to build it will be
>  treated the same as the compiler drivers and programs as far as
>  security coverage is concerned.  However if you find an issue in
>  these libraries independent of their use in GCC, you should reach
>  out to their upstream projects to report them.
>
>  As a result, the only case for a potential security issue in all
>  these cases is when it ends up generating vulnerable output for
>  valid input source code.
>
>  As a result, the only case for a potential security issue in the
>  compiler is when it generates vulnerable application code for
>  trusted input source code that is conforming to the relevant
>  programming standard or extensions documented as supported by GCC
>  and the algorithm expressed in the source code does not have the
>  vulnerability.  The output application code could be considered
>  vulnerable if it produces an actual vulnerability in the target
>  application, specifically in the following cases:
>
>  - The application dereferences an invalid memory location despite
>the application sources being valid.
>  - The application reads from or writes to a valid but incorrect
>memory location, resulting in an information integrity issue or an
>information leak.
>  - The application ends up running in an infinite loop or with
>severe degradation in performance despite the input sources having
>no such issue, resulting in a Denial of Service.  Note that
>correct but non-performant code is not a security issue candidate,
>this only applies to incorrect code that may result in performance
>degradation severe enough to amount to a denial of service.
>  - The application crashes due to the generated incorrect code,
>resulting in a Denial of Service.

One difficulty is that wrong-code bugs are rarely confined to
a particular source code structure.  Something that causes a
miscompilation of a bounds check could later be discovered to cause a
miscompilation of something that is less obviously security-sensitive.
Or the same thing could happen in reverse.  And it's common for the
same bug to be reported multiple times, against different testcases.

The proposal says that certain kinds of wrong code could be a security
bug.  But what will be the criteria for deciding whether a wrong code
bug that *could* be classified as a security bug is in fact a security
bug?  Does someone have to show that at least one 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Richard Sandiford via Gcc-patches
"Kewen.Lin"  writes:
> Hi Richard,
>
> on 2023/8/14 20:20, Richard Sandiford wrote:
>> Thanks for the clean-ups.  But...
>> 
>> "Kewen.Lin"  writes:
>>> Hi,
>>>
>>> Following Richi's suggestion [1], this patch is to move the
>>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>>> of function vectorizable_load to its own loop.  Basically
>>> it duplicates the final loop nest, clean up some useless
>>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>>> unreachable code.  Also remove the corresponding handlings
>>> in the final loop nest.
>>>
>>> Bootstrapped and regtested on x86_64-redhat-linux,
>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>>
>>> Is it ok for trunk?
>>>
>>> BR,
>>> Kewen
>>> -
>>>
>>> gcc/ChangeLog:
>>>
>>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>> VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>> and update the final nest accordingly.
>>> ---
>>>  gcc/tree-vect-stmts.cc | 361 +
>>>  1 file changed, 219 insertions(+), 142 deletions(-)
>> 
>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>> avoid the cut-&-paste?
>
> Thanks for the comments!  I'm not sure if I get your question, if we
> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
> inevitable?  Your concern is mainly about git blame history?

No, it was more that 219-142=77, so it seems like a lot of lines
are being duplicated rather than simply being moved.  (Unlike for
VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
was a clear improvement.)

So I was just wondering if there was any obvious factoring-out that
could be done to reduce the duplication.

Thanks,
Richard




Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-14 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Thu, 10 Aug 2023 at 21:27, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> >> static bool
>> >> is_simple_vla_size (poly_uint64 size)
>> >> {
>> >>   if (size.is_constant ())
>> >> return false;
>> >>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>> >> if (size[i] != (i <= 1 ? size[0] : 0))
>> > Just wondering is this should be (i == 1 ? size[0] : 0) since i is
>> > initialized to 1 ?
>>
>> Both work.  I prefer <= 1 because it doesn't depend on the micro
>> optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
>> poly_int, we want the first 2 coefficients to be nonzero and the rest to
>> be zero.
>>
>> > IIUC, is_simple_vla_size should return true for polynomials of first
>> > degree and having same coeff like 4 + 4x ?
>>
>> FWIW, poly_int only supports first-degree polynomials at the moment.
>> coeffs>2 means there is more than one indeterminate, rather than a
>> higher power.
> Oh OK, thanks for the clarification.
>>
>> >>   return false;
>> >>   return true;
>> >> }
>> >>
>> >>
>> >>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>> >> {
>> >>   auto nunits = GET_MODE_NUNITS (mode);
>> >>   if (!is_simple_vla_size (nunits))
>> >> continue;
>> >>   if (nunits[0] ...)
>> >> test_... (mode);
>> >>   ...
>> >>
>> >> }
>> >>
>> >> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
>> >> loop structure above, I think we can apply the test_vnx4si and
>> >> test_vnx16qi to more cases.  So the classification isn't the
>> >> exact number of elements, but instead a limit.
>> >>
>> >> I think the nunits[0] conditions for test_vnx4si are as follows
>> >> (inspection only, so could be wrong):
>> >>
>> >> > +/* Test cases where result and input vectors are VNx4SI  */
>> >> > +
>> >> > +static void
>> >> > +test_vnx4si (machine_mode vmode)
>> >> > +{
>> >> > +  /* Case 1: mask = {0, ...} */
>> >> > +  {
>> >> > +tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
>> >> > +tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
>> >> > +poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> >> > +
>> >> > +vec_perm_builder builder (len, 1, 1);
>> >> > +builder.quick_push (0);
>> >> > +vec_perm_indices sel (builder, 2, len);
>> >> > +tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
>> >> > +
>> >> > +tree expected_res[] = { vector_cst_elt (res, 0) };
>> > This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
>> >> > +validate_res (1, 1, res, expected_res);
>> >> > +  }
>> >>
>> >> nunits[0] >= 2 (could be all nunits if the inputs had 
>> >> nelts_per_pattern==1,
>> >> which I think would be better)
>> > IIUC, the vectors that can be used for a particular test should have
>> > nunits[0] >= res_npatterns,
>> > where res_npatterns is as computed in fold_vec_perm_cst without the
>> > canonicalization ?
>> > For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
>> > require nunits[0] >= 2 ?
>> > Which implies we can use above test for vectors with length 2 + 2x, 4 + 
>> > 4x, etc.
>>
>> Right, that's what I meant.  With the inputs as they stand it has to be
>> nunits[0] >= 2.  We need that form the inputs correctly.  But if the
>> inputs instead had nelts_per_pattern == 1, the test would work for all
>> nunits.
> In the attached patch, I have reordered the tests based on min or max limit.
> For tests where sel_npatterns < 3 (ie dup sequence), I have kept input
> npatterns = 1,
> so we can test more vector modes, and also input npatterns matter only
> for stepped sequence in sel
> (Since for a dup pattern we don't enforce the constraint of selecting
> elements from same input pattern).
> Does it look OK ?
>
> For the following tests with input vectors having shape (1, 3)
> sel = {0, 1, 2, ...}  // (1, 3)
> res = { arg0[0], arg0[1], arg0[2], ... } // (1, 3)
>
> and sel = {len, len + 1, len + 2, ... }  // (1, 3)
> res = { arg1[0], arg1[1], arg1[2], ... } // (1, 3)
>
> Altho res_npatterns = 1, I suppose these will need to be tested with
> vectors with length >= 4 + 4x,
> since index 2 can be ambiguous for length 2 + 2x  ?
> (In the patch, these are cases 2 and 3 in test_nunits_min_4)

Ah, yeah, fair point.  I guess that means:

+  /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
+Test that stepped sequence of the pattern selects from arg0.
+res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
+  {
+   tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+   tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+   poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+   vec_perm_builder builder (len, 1, 3);
+   poly_uint64 mask_elems[] = { len, 0, 1 };
+   builder_push_elems (builder, mask_elems);
+
+   vec_perm_indices sel (builder, 2, len);
+   tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+   tree expected_res[] = { ARG1(0), 

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Richard Sandiford via Gcc-patches
Thanks for the clean-ups.  But...

"Kewen.Lin"  writes:
> Hi,
>
> Following Richi's suggestion [1], this patch is to move the
> handlings on VMAT_GATHER_SCATTER in the final loop nest
> of function vectorizable_load to its own loop.  Basically
> it duplicates the final loop nest, clean up some useless
> set up code for the case of VMAT_GATHER_SCATTER, remove some
> unreachable code.  Also remove the corresponding handlings
> in the final loop nest.
>
> Bootstrapped and regtested on x86_64-redhat-linux,
> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>
> Is it ok for trunk?
>
> BR,
> Kewen
> -
>
> gcc/ChangeLog:
>
>   * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>   VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>   and update the final nest accordingly.
> ---
>  gcc/tree-vect-stmts.cc | 361 +
>  1 file changed, 219 insertions(+), 142 deletions(-)

...that seems like quite a lot of +s.  Is there nothing we can do to
avoid the cut-&-paste?

Richard

>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c361e16cb7b..5e514eca19b 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -10455,6 +10455,218 @@ vectorizable_load (vec_info *vinfo,
>return true;
>  }
>
> +  if (memory_access_type == VMAT_GATHER_SCATTER)
> +{
> +  gcc_assert (alignment_support_scheme == dr_aligned
> +   || alignment_support_scheme == dr_unaligned_supported);
> +  gcc_assert (!grouped_load && !slp_perm);
> +
> +  unsigned int inside_cost = 0, prologue_cost = 0;
> +  for (j = 0; j < ncopies; j++)
> + {
> +   /* 1. Create the vector or array pointer update chain.  */
> +   if (j == 0 && !costing_p)
> + {
> +   if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> + vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
> +  slp_node, _info, _ptr,
> +  _offsets);
> +   else
> + dataref_ptr
> +   = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
> +   at_loop, offset, , gsi,
> +   _incr, false, bump);
> + }
> +   else if (!costing_p)
> + {
> +   gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
> +   if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
> +gsi, stmt_info, bump);
> + }
> +
> +   if (mask && !costing_p)
> + vec_mask = vec_masks[j];
> +
> +   gimple *new_stmt = NULL;
> +   for (i = 0; i < vec_num; i++)
> + {
> +   tree final_mask = NULL_TREE;
> +   tree final_len = NULL_TREE;
> +   tree bias = NULL_TREE;
> +   if (!costing_p)
> + {
> +   if (loop_masks)
> + final_mask
> +   = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
> + vec_num * ncopies, vectype,
> + vec_num * j + i);
> +   if (vec_mask)
> + final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
> +final_mask, vec_mask, gsi);
> +
> +   if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
> +gsi, stmt_info, bump);
> + }
> +
> +   /* 2. Create the vector-load in the loop.  */
> +   unsigned HOST_WIDE_INT align;
> +   if (gs_info.ifn != IFN_LAST)
> + {
> +   if (costing_p)
> + {
> +   unsigned int cnunits = vect_nunits_for_cost (vectype);
> +   inside_cost
> + = record_stmt_cost (cost_vec, cnunits, scalar_load,
> + stmt_info, 0, vect_body);
> +   continue;
> + }
> +   if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
> + vec_offset = vec_offsets[vec_num * j + i];
> +   tree zero = build_zero_cst (vectype);
> +   tree scale = size_int (gs_info.scale);
> +
> +   if (gs_info.ifn == IFN_MASK_LEN_GATHER_LOAD)
> + {
> +   if (loop_lens)
> + final_len
> +   = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
> +vec_num * ncopies, vectype,
> +vec_num * j + i, 1);
> +   else
> + final_len
> 

Re: [PATCH] genrecog: Add SUBREG_BYTE.to_constant check to the genrecog

2023-08-14 Thread Richard Sandiford via Gcc-patches
Juzhe-Zhong  writes:
> Hi, there is genrecog issue happens in RISC-V backend.
>
> This is the ICE info:
>
> 0xfa3ba4 poly_int_pod<2u, unsigned short>::to_constant() const
> ../../../riscv-gcc/gcc/poly-int.h:504
> 0x28eaa91 recog_5
> ../../../riscv-gcc/gcc/config/riscv/bitmanip.md:314
> 0x28ec5b4 recog_7
> ../../../riscv-gcc/gcc/config/riscv/iterators.md:81
> 0x2a2e740 recog_436
> ../../../riscv-gcc/gcc/config/riscv/thead.md:265
> 0x2a729ef recog_475
> ../../../riscv-gcc/gcc/config/riscv/sync.md:509
> 0x2a75aec recog(rtx_def*, rtx_insn*, int*)
> ../../../riscv-gcc/gcc/config/riscv/iterators.md:55
> 0x2b3e39e recog_for_combine_1
> ../../../riscv-gcc/gcc/combine.cc:11382
> 0x2b3f457 recog_for_combine
> ../../../riscv-gcc/gcc/combine.cc:11652
> 0x2b25a15 try_combine
> ../../../riscv-gcc/gcc/combine.cc:4054
> 0x2b1d3f1 combine_instructions
> ../../../riscv-gcc/gcc/combine.cc:1266
> 0x2b48cfc rest_of_handle_combine
> ../../../riscv-gcc/gcc/combine.cc:15063
> 0x2b48db8 execute
> ../../../riscv-gcc/gcc/combine.cc:15107
>
> This is because the genrecog code here cause ICE for scalable vector in 
> RISC-V:
>
> Before this patch:
>
> static int
> recog_5 (rtx x1 ATTRIBUTE_UNUSED,
> rtx_insn *insn ATTRIBUTE_UNUSED,
> int *pnum_clobbers ATTRIBUTE_UNUSED)
> {
>   rtx * const operands ATTRIBUTE_UNUSED = _data.operand[0];
>   rtx x2, x3, x4;
>   int res ATTRIBUTE_UNUSED;
>   if (pnum_clobbers == NULL)
> return -1;
>   x2 = XEXP (x1, 1);
>   x3 = XEXP (x2, 0);
>   if (maybe_ne (SUBREG_BYTE (x3).to_constant (), 0) ---> this code cause ICE.
>   || GET_MODE (x3) != E_SImode
>   || !register_operand (operands[0], E_DImode)
>   || GET_MODE (x2) != E_DImode)
> return -1;
> ...
>
> This ICE happens since we have following RTL IR:
>
> (insn 27 26 29 4 (set (reg:RVVM1HI 155 [ vect__12.23 ])
> (sign_extend:RVVM1HI (subreg:RVVMF2QI (reg:RVVMF2x2QI 146 [ 
> vect_array.19 ]) [8, 8]))) "auto.c":29:1 discrim 2 12570 
> {extendrvvmf2qirvvm1hi2}
>  (expr_list:REG_DEAD (reg:RVVMF2x2QI 146 [ vect_array.19 ])
> (nil)))
>
> This is the scalable vector with SUBREG_BYTE = poly (8, 8)
>
> After this patch:
>
> static int
> recog_5 (rtx x1 ATTRIBUTE_UNUSED,
> rtx_insn *insn ATTRIBUTE_UNUSED,
> int *pnum_clobbers ATTRIBUTE_UNUSED)
> {
>   rtx * const operands ATTRIBUTE_UNUSED = _data.operand[0];
>   rtx x2, x3, x4;
>   int res ATTRIBUTE_UNUSED;
>   if (pnum_clobbers == NULL)
> return -1;
>   x2 = XEXP (x1, 1);
>   x3 = XEXP (x2, 0);
>   if ((SUBREG_BYTE (x3).is_constant () && maybe_ne (SUBREG_BYTE 
> (x3).to_constant (), 0))   > change here and fix ICE.
>   || GET_MODE (x3) != E_SImode
>   || !register_operand (operands[0], E_DImode)
>   || GET_MODE (x2) != E_DImode)
> return -1;
>
> Does it reasonable ?
>
> Thanks.
>
> gcc/ChangeLog:
>
> * genrecog.cc (print_test): Add SUBREG_BYTE.to_constant () check.

I think instead we should revert the addition of to_constant.
See: https://inbox.sourceware.org/gcc-patches/mptedn4bwf7@arm.com/
(and earlier messages in that thread).

Thanks,
Richard

>
> ---
>  gcc/genrecog.cc | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/genrecog.cc b/gcc/genrecog.cc
> index 04a5533ca4b..28884ab3985 100644
> --- a/gcc/genrecog.cc
> +++ b/gcc/genrecog.cc
> @@ -4705,11 +4705,14 @@ print_test (output_state *os, const rtx_test , 
> bool is_param,
>break;
>  
>  case rtx_test::SUBREG_FIELD:
> +  printf ("(SUBREG_BYTE (");
> +  print_test_rtx (os, test);
> +  printf (").is_constant () && ");
>printf ("%s (", invert_p ? "maybe_ne" : "known_eq");
>print_nonbool_test (os, test);
>printf (", ");
>print_label_value (test, is_param, value);
> -  printf (")");
> +  printf ("))");
>break;
>  
>  case rtx_test::SAVED_CONST_INT:


Re: [PATCH V3] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-11 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Fri, 11 Aug 2023, juzhe.zh...@rivai.ai wrote:
>
>> Hi, Richi.
>> 
>> > 1. Target is using loop MASK as the partial vector loop control.
>> >> I don't think it checks for this?
>> 
>> I am not sure whether I understand EXTRACT_LAST correctly.
>> But if target doesn't use loop MASK for partial vector loop control, how 
>> does target use EXTRACT_LAST?
>> Since EXTRACT_LAST is always extracting the last element of the vector 
>> according to MASK operand.
>> 
>> > But we don't really know this at this point?  The only thing we know
>> > is that nothing set LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P to false.
>> 
>> Yes. So I am try to use 'get_len_load_store' to check whether target support 
>> LEN loop control.
>> 
>> Well, I admit it's not a good idea.
>> 
>> 
>> > I think it should work to change the direct_internal_fn_supported_p
>> > check for IFN_EXTRACT_LAST to a "poitive" one guarding
>> 
>> >   gcc_assert (ncopies == 1 && !slp_node);
>> >   vect_record_loop_mask (loop_vinfo,
>> >  _VINFO_MASKS (loop_vinfo),
>> >  1, vectype, NULL);
>> 
>> > and in the else branch check for VEC_EXTRACT support and if present
>> > record a loop len.  Just in this case this particular order would
>> > be important.
>> 
>> Do you mean change the codes as follows :?
>> 
>> - if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
>> -  OPTIMIZE_FOR_SPEED))
>> -   {
>> - if (dump_enabled_p ())
>> -   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> -"can't operate on partial vectors "
>> -"because the target doesn't support extract 
>> "
>> -"last reduction.\n");
>> - LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> -   }
>> - else if (slp_node)
>>   if (slp_node)
>> {
>>   if (dump_enabled_p ())
>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>  "can't operate on partial vectors "
>>  "because an SLP statement is live after "
>>  "the loop.\n");
>>   LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> }
>>   else if (ncopies > 1)
>> {
>>   if (dump_enabled_p ())
>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>  "can't operate on partial vectors "
>>  "because ncopies is greater than 1.\n");
>>   LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>> }
>>   else
>> {
>>   gcc_assert (ncopies == 1 && !slp_node);
>>   if (direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
>>   OPTIMIZE_FOR_SPEED))
>> vect_record_loop_mask (loop_vinfo,
>>_VINFO_MASKS (loop_vinfo),
>>1, vectype, NULL);
>>   else
>
> check here the target supports VEC_EXTRACT
>
>> vect_record_loop_len (loop_vinfo,
>>   _VINFO_LENS (loop_vinfo),
>>   1, vectype, 1);
>
> else set LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P to false with a
> diagnostic.

I agree with all this FWIW.  That is, the check should be based on
.VEC_EXTRACT alone, but .EXTRACT_LAST should take priority (not least
because SVE provides both .VEC_EXTRACT and .EXTRACT_LAST).

Thanks,
Richard


Re: [PATCH] tree-optimization/110979 - fold-left reduction and partial vectors

2023-08-11 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> When we vectorize fold-left reductions with partial vectors but
> no target operation available we use a vector conditional to force
> excess elements to zero.  But that doesn't correctly preserve
> the sign of zero.  The following patch disables partial vector
> support in that case.
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> Does this look OK?

LGTM.

> With -frounding-math -fno-signed-zeros we are
> happily using the masking again, but that's OK, right?  An additional
> + 0.0 shouldn't do anything here.

Yeah, I would hope so.

Thanks,
Richard

>
> Thanks,
> Richard.
>
>   PR tree-optimization/110979
>   * tree-vect-loop.cc (vectorizable_reduction): For
>   FOLD_LEFT_REDUCTION without target support make sure
>   we don't need to honor signed zeros.
>
>   * gcc.dg/torture/pr110979.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/torture/pr110979.c | 25 +
>  gcc/tree-vect-loop.cc   | 11 +++
>  2 files changed, 36 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr110979.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr110979.c 
> b/gcc/testsuite/gcc.dg/torture/pr110979.c
> new file mode 100644
> index 000..c25ad7a8a31
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr110979.c
> @@ -0,0 +1,25 @@
> +/* { dg-do run } */
> +/* { dg-additional-options "--param vect-partial-vector-usage=2" } */
> +
> +#define FLT double
> +#define N 20
> +
> +__attribute__((noipa))
> +FLT
> +foo3 (FLT *a)
> +{
> +  FLT sum = -0.0;
> +  for (int i = 0; i != N; i++)
> +sum += a[i];
> +  return sum;
> +}
> +
> +int main()
> +{
> +  FLT a[N];
> +  for (int i = 0; i != N; i++)
> +a[i] = -0.0;
> +  if (!__builtin_signbit(foo3(a)))
> +__builtin_abort();
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index bf8d677b584..741b5c20389 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8037,6 +8037,17 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>" no conditional operation is available.\n");
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>   }
> +  else if (reduction_type == FOLD_LEFT_REDUCTION
> +&& reduc_fn == IFN_LAST
> +&& FLOAT_TYPE_P (vectype_in)
> +&& HONOR_SIGNED_ZEROS (vectype_in))
> + {
> +   if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +  "can't operate on partial vectors because"
> +  " signed zeros need to be handled.\n");
> +   LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> + }
>else
>   {
> internal_fn mask_reduc_fn


Re: [RFC] GCC Security policy

2023-08-10 Thread Richard Sandiford via Gcc-patches
Siddhesh Poyarekar  writes:
> On 2023-08-08 10:30, Siddhesh Poyarekar wrote:
>>> Do you have a suggestion for the language to address libgcc, 
>>> libstdc++, etc. and libiberty, libbacktrace, etc.?
>> 
>> I'll work on this a bit and share a draft.
>
> Hi David,
>
> Here's what I came up with for different parts of GCC, including the 
> runtime libraries.  Over time we may find that specific parts of runtime 
> libraries simply cannot be used safely in some contexts and flag that.
>
> Sid
>
> """
> What is a GCC security bug?
> ===
>
>  A security bug is one that threatens the security of a system or
>  network, or might compromise the security of data stored on it.
>  In the context of GCC there are multiple ways in which this might
>  happen and they're detailed below.
>
> Compiler drivers, programs, libgccjit and support libraries
> ---
>
>  The compiler driver processes source code, invokes other programs
>  such as the assembler and linker and generates the output result,
>  which may be assembly code or machine code.  It is necessary that
>  all source code inputs to the compiler are trusted, since it is
>  impossible for the driver to validate input source code beyond
>  conformance to a programming language standard.
>
>  The GCC JIT implementation, libgccjit, is intended to be plugged
>  into applications to translate input source code in the application
>  context.  Limitations that apply to the compiler
>  driver, apply here too in terms of sanitizing inputs, so it is
>  recommended that inputs are either sanitized by an external program
>  to allow only trusted, safe execution in the context of the
>  application or the JIT execution context is appropriately sandboxed
>  to contain the effects of any bugs in the JIT or its generated code
>  to the sandboxed environment.
>
>  Support libraries such as libiberty, libcc1 libvtv and libcpp have
>  been developed separately to share code with other tools such as
>  binutils and gdb.  These libraries again have similar challenges to
>  compiler drivers.  While they are expected to be robust against
>  arbitrary input, they should only be used with trusted inputs.
>
>  Libraries such as zlib and libffi that bundled into GCC to build it
>  will be treated the same as the compiler drivers and programs as far
>  as security coverage is concerned.
>
>  As a result, the only case for a potential security issue in all
>  these cases is when it ends up generating vulnerable output for
>  valid input source code.

I think this leaves open the interpretation "every wrong code bug
is potentially a security bug".  I suppose that's true in a trite sense,
but not in a useful sense.  As others said earlier in the thread,
whether a wrong code bug in GCC leads to a security bug in the object
code is too application-dependent to be a useful classification for GCC.

I think we should explicitly say that we don't generally consider wrong
code bugs to be security bugs.  Leaving it implicit is bound to lead
to misunderstanding.

There's another case that I think should be highlighted explicitly:
GCC provides various security-hardening features.  I think any failure
of those feature to act as documented is poentially a security bug.
Failure to follow reasonable expectations (even if not documented)
might sometimes be a security bug too.

Thanks,
Richard
>
> Language runtime libraries
> --
>
>  GCC also builds and distributes libraries that are intended to be
>  used widely to implement runtime support for various programming
>  languages.  These include the following:
>
>  * libada
>  * libatomic
>  * libbacktrace
>  * libcc1
>  * libcody
>  * libcpp
>  * libdecnumber
>  * libgcc
>  * libgfortran
>  * libgm2
>  * libgo
>  * libgomp
>  * libiberty
>  * libitm
>  * libobjc
>  * libphobos
>  * libquadmath
>  * libssp
>  * libstdc++
>
>  These libraries are intended to be used in arbitrary contexts and as
>  a result, bugs in these libraries may be evaluated for security
>  impact.  However, some of these libraries, e.g. libgo, libphobos,
>  etc.  are not maintained in the GCC project, due to which the GCC
>  project may not be the correct point of contact for them.  You are
>  encouraged to look at README files within those library directories
>  to locate the canonical security contact point for those projects.
>
> Diagnostic libraries
> 
>
>  The sanitizer library bundled in GCC is intended to be used in
>  diagnostic cases and not intended for use in sensitive environments.
>  As a result, bugs in the sanitizer will not be considered security
>  sensitive.
>
> GCC plugins
> ---
>
>  It should 

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-10 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
>> static bool
>> is_simple_vla_size (poly_uint64 size)
>> {
>>   if (size.is_constant ())
>> return false;
>>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>> if (size[i] != (i <= 1 ? size[0] : 0))
> Just wondering is this should be (i == 1 ? size[0] : 0) since i is
> initialized to 1 ?

Both work.  I prefer <= 1 because it doesn't depend on the micro
optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
poly_int, we want the first 2 coefficients to be nonzero and the rest to
be zero.

> IIUC, is_simple_vla_size should return true for polynomials of first
> degree and having same coeff like 4 + 4x ?

FWIW, poly_int only supports first-degree polynomials at the moment.
coeffs>2 means there is more than one indeterminate, rather than a
higher power.

>>   return false;
>>   return true;
>> }
>>
>>
>>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>> {
>>   auto nunits = GET_MODE_NUNITS (mode);
>>   if (!is_simple_vla_size (nunits))
>> continue;
>>   if (nunits[0] ...)
>> test_... (mode);
>>   ...
>>
>> }
>>
>> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
>> loop structure above, I think we can apply the test_vnx4si and
>> test_vnx16qi to more cases.  So the classification isn't the
>> exact number of elements, but instead a limit.
>>
>> I think the nunits[0] conditions for test_vnx4si are as follows
>> (inspection only, so could be wrong):
>>
>> > +/* Test cases where result and input vectors are VNx4SI  */
>> > +
>> > +static void
>> > +test_vnx4si (machine_mode vmode)
>> > +{
>> > +  /* Case 1: mask = {0, ...} */
>> > +  {
>> > +tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
>> > +tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
>> > +poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +vec_perm_builder builder (len, 1, 1);
>> > +builder.quick_push (0);
>> > +vec_perm_indices sel (builder, 2, len);
>> > +tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
>> > +
>> > +tree expected_res[] = { vector_cst_elt (res, 0) };
> This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
>> > +validate_res (1, 1, res, expected_res);
>> > +  }
>>
>> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
>> which I think would be better)
> IIUC, the vectors that can be used for a particular test should have
> nunits[0] >= res_npatterns,
> where res_npatterns is as computed in fold_vec_perm_cst without the
> canonicalization ?
> For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
> require nunits[0] >= 2 ?
> Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, 
> etc.

Right, that's what I meant.  With the inputs as they stand it has to be
nunits[0] >= 2.  We need that form the inputs correctly.  But if the
inputs instead had nelts_per_pattern == 1, the test would work for all
nunits.

> Sorry if this sounds like a silly question -- Won't nunits[0] >= 2
> cover all nunits,
> since a vector, at a minimum, will contain 2 elements ?

Not necessarily.  VNx1TI makes conceptual sense.  We just don't use it
currently (although that'll change with SME).  And we do have single-element
VLS vectors like V1DI and V1DF.

Thanks,
Richard


Re: [PATCH] VR-VALUES: Simplify comparison using range pairs

2023-08-10 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Thu, Aug 10, 2023 at 3:44 PM Richard Sandiford
>  wrote:
>>
>> Richard Biener via Gcc-patches  writes:
>> > On Wed, Aug 9, 2023 at 6:16 PM Andrew Pinski via Gcc-patches
>> >  wrote:
>> >>
>> >> If `A` has a range of `[0,0][100,INF]` and the comparison
>> >> of `A < 50`. This should be optimized to `A <= 0` (which then
>> >> will be optimized to just `A == 0`).
>> >> This patch implement this via a new function which sees if
>> >> the constant of a comparison is in the middle of 2 range pairs
>> >> and change the constant to the either upper bound of the first pair
>> >> or the lower bound of the second pair depending on the comparison.
>> >>
>> >> This is the first step in fixing the following PRS:
>> >> PR 110131, PR 108360, and PR 108397.
>> >>
>> >> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
>> >
>> >
>> >
>> >> gcc/ChangeLog:
>> >>
>> >> * vr-values.cc (simplify_compare_using_range_pairs): New function.
>> >> (simplify_using_ranges::simplify_compare_using_ranges_1): Call
>> >> it.
>> >>
>> >> gcc/testsuite/ChangeLog:
>> >>
>> >> * gcc.dg/tree-ssa/vrp124.c: New test.
>> >> * gcc.dg/pr21643.c: Disable VRP.
>> >> ---
>> >>  gcc/testsuite/gcc.dg/pr21643.c |  6 ++-
>> >>  gcc/testsuite/gcc.dg/tree-ssa/vrp124.c | 44 +
>> >>  gcc/vr-values.cc   | 65 ++
>> >>  3 files changed, 114 insertions(+), 1 deletion(-)
>> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>> >>
>> >> diff --git a/gcc/testsuite/gcc.dg/pr21643.c 
>> >> b/gcc/testsuite/gcc.dg/pr21643.c
>> >> index 4e7f93d351a..7f121d7006f 100644
>> >> --- a/gcc/testsuite/gcc.dg/pr21643.c
>> >> +++ b/gcc/testsuite/gcc.dg/pr21643.c
>> >> @@ -1,6 +1,10 @@
>> >>  /* PR tree-optimization/21643 */
>> >>  /* { dg-do compile } */
>> >> -/* { dg-options "-O2 -fdump-tree-reassoc1-details --param 
>> >> logical-op-non-short-circuit=1" } */
>> >> +/* Note VRP is able to transform `c >= 0x20` in f7
>> >> +   to `c >= 0x21` since we want to test
>> >> +   reassociation and not VRP, turn it off. */
>> >> +
>> >> +/* { dg-options "-O2 -fdump-tree-reassoc1-details --param 
>> >> logical-op-non-short-circuit=1 -fno-tree-vrp" } */
>> >>
>> >>  int
>> >>  f1 (unsigned char c)
>> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c 
>> >> b/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>> >> new file mode 100644
>> >> index 000..6ccbda35d1b
>> >> --- /dev/null
>> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>> >> @@ -0,0 +1,44 @@
>> >> +/* { dg-do compile } */
>> >> +/* { dg-options "-O2 -fdump-tree-optimized" } */
>> >> +
>> >> +/* Should be optimized to a == -100 */
>> >> +int g(int a)
>> >> +{
>> >> +  if (a == -100 || a >= 0)
>> >> +;
>> >> +  else
>> >> +return 0;
>> >> +  return a < 0;
>> >> +}
>> >> +
>> >> +/* Should optimize to a == 0 */
>> >> +int f(int a)
>> >> +{
>> >> +  if (a == 0 || a > 100)
>> >> +;
>> >> +  else
>> >> +return 0;
>> >> +  return a < 50;
>> >> +}
>> >> +
>> >> +/* Should be optimized to a == 0. */
>> >> +int f2(int a)
>> >> +{
>> >> +  if (a == 0 || a > 100)
>> >> +;
>> >> +  else
>> >> +return 0;
>> >> +  return a < 100;
>> >> +}
>> >> +
>> >> +/* Should optimize to a == 100 */
>> >> +int f1(int a)
>> >> +{
>> >> +  if (a < 0 || a == 100)
>> >> +;
>> >> +  else
>> >> +return 0;
>> >> +  return a > 50;
>> >> +}
>> >> +
>> >> +/* { dg-final { scan-tree-dump-not "goto " "optimized" } } */
>> >> diff --git a/gcc/vr-values.cc b/gcc/vr-values.cc
>> >> index a4fddd62841..1262e7cf9f0 100644
>> >> --- a/gcc/vr-values.cc
>> >> +++ b/gcc/vr-values.cc
>> >> @@ -968,9 +968,72 @@ test_for_singularity (enum tree_code cond_code, tree 
>> >> op0,
>> >>if (operand_equal_p (min, max, 0) && is_gimple_min_invariant (min))
>> >> return min;
>> >>  }
>> >> +
>> >>return NULL;
>> >>  }
>> >>
>> >> +/* Simplify integer comparisons such that the constant is one of the 
>> >> range pairs.
>> >> +   For an example,
>> >> +   A has a range of [0,0][100,INF]
>> >> +   and the comparison of `A < 50`.
>> >> +   This should be optimized to `A <= 0`
>> >> +   and then test_for_singularity can optimize it to `A == 0`.   */
>> >> +
>> >> +static bool
>> >> +simplify_compare_using_range_pairs (tree_code _code, tree , 
>> >> tree ,
>> >> +   const value_range *vr)
>> >> +{
>> >> +  if (TREE_CODE (op1) != INTEGER_CST
>> >> +  || vr->num_pairs () < 2)
>> >> +return false;
>> >> +  auto val_op1 = wi::to_wide (op1);
>> >> +  tree type = TREE_TYPE (op0);
>> >> +  auto sign = TYPE_SIGN (type);
>> >> +  auto p = vr->num_pairs ();
>> >> +  /* Find the value range pair where op1
>> >> + is in the middle of if one exist. */
>> >> +  for (unsigned i = 1; i < p; i++)
>> >> +{
>> >> +  auto lower = vr->upper_bound (i - 1);
>> >> +  auto upper = vr->lower_bound (i);
>> >> +  if (wi::lt_p (val_op1, lower, 

Re: [PATCH] VR-VALUES: Simplify comparison using range pairs

2023-08-10 Thread Richard Sandiford via Gcc-patches
Richard Biener via Gcc-patches  writes:
> On Wed, Aug 9, 2023 at 6:16 PM Andrew Pinski via Gcc-patches
>  wrote:
>>
>> If `A` has a range of `[0,0][100,INF]` and the comparison
>> of `A < 50`. This should be optimized to `A <= 0` (which then
>> will be optimized to just `A == 0`).
>> This patch implement this via a new function which sees if
>> the constant of a comparison is in the middle of 2 range pairs
>> and change the constant to the either upper bound of the first pair
>> or the lower bound of the second pair depending on the comparison.
>>
>> This is the first step in fixing the following PRS:
>> PR 110131, PR 108360, and PR 108397.
>>
>> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
>
>
>
>> gcc/ChangeLog:
>>
>> * vr-values.cc (simplify_compare_using_range_pairs): New function.
>> (simplify_using_ranges::simplify_compare_using_ranges_1): Call
>> it.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/tree-ssa/vrp124.c: New test.
>> * gcc.dg/pr21643.c: Disable VRP.
>> ---
>>  gcc/testsuite/gcc.dg/pr21643.c |  6 ++-
>>  gcc/testsuite/gcc.dg/tree-ssa/vrp124.c | 44 +
>>  gcc/vr-values.cc   | 65 ++
>>  3 files changed, 114 insertions(+), 1 deletion(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/pr21643.c b/gcc/testsuite/gcc.dg/pr21643.c
>> index 4e7f93d351a..7f121d7006f 100644
>> --- a/gcc/testsuite/gcc.dg/pr21643.c
>> +++ b/gcc/testsuite/gcc.dg/pr21643.c
>> @@ -1,6 +1,10 @@
>>  /* PR tree-optimization/21643 */
>>  /* { dg-do compile } */
>> -/* { dg-options "-O2 -fdump-tree-reassoc1-details --param 
>> logical-op-non-short-circuit=1" } */
>> +/* Note VRP is able to transform `c >= 0x20` in f7
>> +   to `c >= 0x21` since we want to test
>> +   reassociation and not VRP, turn it off. */
>> +
>> +/* { dg-options "-O2 -fdump-tree-reassoc1-details --param 
>> logical-op-non-short-circuit=1 -fno-tree-vrp" } */
>>
>>  int
>>  f1 (unsigned char c)
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c 
>> b/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>> new file mode 100644
>> index 000..6ccbda35d1b
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp124.c
>> @@ -0,0 +1,44 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-optimized" } */
>> +
>> +/* Should be optimized to a == -100 */
>> +int g(int a)
>> +{
>> +  if (a == -100 || a >= 0)
>> +;
>> +  else
>> +return 0;
>> +  return a < 0;
>> +}
>> +
>> +/* Should optimize to a == 0 */
>> +int f(int a)
>> +{
>> +  if (a == 0 || a > 100)
>> +;
>> +  else
>> +return 0;
>> +  return a < 50;
>> +}
>> +
>> +/* Should be optimized to a == 0. */
>> +int f2(int a)
>> +{
>> +  if (a == 0 || a > 100)
>> +;
>> +  else
>> +return 0;
>> +  return a < 100;
>> +}
>> +
>> +/* Should optimize to a == 100 */
>> +int f1(int a)
>> +{
>> +  if (a < 0 || a == 100)
>> +;
>> +  else
>> +return 0;
>> +  return a > 50;
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-not "goto " "optimized" } } */
>> diff --git a/gcc/vr-values.cc b/gcc/vr-values.cc
>> index a4fddd62841..1262e7cf9f0 100644
>> --- a/gcc/vr-values.cc
>> +++ b/gcc/vr-values.cc
>> @@ -968,9 +968,72 @@ test_for_singularity (enum tree_code cond_code, tree 
>> op0,
>>if (operand_equal_p (min, max, 0) && is_gimple_min_invariant (min))
>> return min;
>>  }
>> +
>>return NULL;
>>  }
>>
>> +/* Simplify integer comparisons such that the constant is one of the range 
>> pairs.
>> +   For an example,
>> +   A has a range of [0,0][100,INF]
>> +   and the comparison of `A < 50`.
>> +   This should be optimized to `A <= 0`
>> +   and then test_for_singularity can optimize it to `A == 0`.   */
>> +
>> +static bool
>> +simplify_compare_using_range_pairs (tree_code _code, tree , tree 
>> ,
>> +   const value_range *vr)
>> +{
>> +  if (TREE_CODE (op1) != INTEGER_CST
>> +  || vr->num_pairs () < 2)
>> +return false;
>> +  auto val_op1 = wi::to_wide (op1);
>> +  tree type = TREE_TYPE (op0);
>> +  auto sign = TYPE_SIGN (type);
>> +  auto p = vr->num_pairs ();
>> +  /* Find the value range pair where op1
>> + is in the middle of if one exist. */
>> +  for (unsigned i = 1; i < p; i++)
>> +{
>> +  auto lower = vr->upper_bound (i - 1);
>> +  auto upper = vr->lower_bound (i);
>> +  if (wi::lt_p (val_op1, lower, sign))
>> +   continue;
>> +  if (wi::gt_p (val_op1, upper, sign))
>> +   continue;
>
> That looks like a linear search - it looks like m_base[] is
> a sorted array of values so we should be able to
> binary search here?  array_slice::bsearch could be
> used if it existed (simply port it over from vec<> and
> use array_slice from that)?

Better to use std::lower_bound IMO, rather than implement our
own custom bsearch.

Thanks,
Richard


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-08-10 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek  writes:
> On Wed, Aug 09, 2023 at 06:27:20PM +0100, Richard Sandiford wrote:
>> Jakub Jelinek  writes:
>> > On Wed, Aug 09, 2023 at 05:55:28PM +0100, Richard Sandiford wrote:
>> >> Jakub: do you remember what the reason was?  I don't mind dropping
>> >> "function", but it feels weird to drop the quotes around "simd".
>> >> Seems like, if we do that, there'll one day be a patch to add
>> >> them back. :)
>> >
>> > Because in OpenMP their are % functions, not %
>> > %functions, but we also have the %/% attribute as
>> > extension.
>> 
>> Yeah, I can understand dropping the "function" bit.  But why
>> s/unsupported ... for %/unsupported ... for simd/?
>> Even if it's only a partial syntax quote, it is still a syntax quote.
>
> % in OpenMP is something very different though, so I think it is
> better to use it as a generic term which covers the different syntax cases.

OK, I won't press it further.

Richard


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-08-09 Thread Richard Sandiford via Gcc-patches
Jakub Jelinek  writes:
> On Wed, Aug 09, 2023 at 05:55:28PM +0100, Richard Sandiford wrote:
>> Jakub: do you remember what the reason was?  I don't mind dropping
>> "function", but it feels weird to drop the quotes around "simd".
>> Seems like, if we do that, there'll one day be a patch to add
>> them back. :)
>
> Because in OpenMP their are % functions, not %
> %functions, but we also have the %/% attribute as
> extension.

Yeah, I can understand dropping the "function" bit.  But why
s/unsupported ... for %/unsupported ... for simd/?
Even if it's only a partial syntax quote, it is still a syntax quote.

Thanks,
Richard


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-08-09 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> Here is my new version, see inline response to your comments.
>
> New cover letter:
>
> This patch enables the use of mixed-types for simd clones for AArch64, 
> adds aarch64 as a target_vect_simd_clones and corrects the way the 
> simdlen is chosen for non-specified simdlen clauses according to the 
> 'Vector Function Application Binary Interface Specification for AArch64'.
>
> gcc/ChangeLog:
>
>  * config/aarch64/aarch64.cc (currently_supported_simd_type): 
> Remove.
>  (aarch64_simd_clone_compute_vecsize_and_simdlen): Determine 
> simdlen according to NDS rule.
>  (lane_size): New function.
>
> gcc/testsuite/ChangeLog:
>
>  * lib/target-supports.exp: Add aarch64 targets to vect_simd_clones.
>  * c-c++-common/gomp/declare-variant-14.c: Add aarch64 checks 
> and remove warning check.
>  * g++.dg/gomp/attrs-10.C: Likewise.
>  * g++.dg/gomp/declare-simd-1.C: Likewise.
>  * g++.dg/gomp/declare-simd-3.C: Likewise.
>  * g++.dg/gomp/declare-simd-4.C: Likewise.
>  * gcc.dg/gomp/declare-simd-3.c: Likewise.
>  * gcc.dg/gomp/simd-clones-2.c: Likewise.
>  * gfortran.dg/gomp/declare-variant-14.f90: Likewise.
>  * c-c++-common/gomp/pr60823-1.c: Remove warning check.
>  * c-c++-common/gomp/pr60823-3.c: Likewise.
>  * g++.dg/gomp/declare-simd-7.C: Likewise.
>  * g++.dg/gomp/declare-simd-8.C: Likewise.
>  * g++.dg/gomp/pr88182.C: Likewise.
>  * gcc.dg/declare-simd.c: Likewise.
>  * gcc.dg/gomp/declare-simd-1.c: Likewise.
>  * gcc.dg/gomp/pr87895-1.c: Likewise.
>  * gfortran.dg/gomp/declare-simd-2.f90: Likewise.
>  * gfortran.dg/gomp/declare-simd-coarray-lib.f90: Likewise.
>  * gfortran.dg/gomp/pr79154-1.f90: Likewise.
>  * gfortran.dg/gomp/pr83977.f90: Likewise.
>  * gcc.dg/gomp/pr87887-1.c: Add warning test.
>  * gcc.dg/gomp/pr89246-1.c: Likewise.
>  * gcc.dg/gomp/pr99542.c: Update warning test.
>
>
>
> On 08/08/2023 11:51, Richard Sandiford wrote:
>> "Andre Vieira (lists)"  writes:
>
>>> warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
>>> -   "unsupported return type %qT for % functions",
>>> +   "unsupported return type %qT for simd",
>>> ret_type);
>> 
>> What's the reason for s/% functions/simd/, in particular for
>> dropping the quotes around simd?
>
> It's to align with i386's error message, this helps with testing as then 
> I can avoid having different tests for the same error.
>
> I asked Jakub which one he preferred, and he gave me an explanation why 
> the i386's one was preferable, ... but I didn't write it down unfortunately.

Jakub: do you remember what the reason was?  I don't mind dropping
"function", but it feels weird to drop the quotes around "simd".
Seems like, if we do that, there'll one day be a patch to add
them back. :)

Thanks,
Richard


Re: [PATCH] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-09 Thread Richard Sandiford via Gcc-patches
"juzhe.zh...@rivai.ai"  writes:
> Hi, Richi.
>
>>> that should be
>
>>>   || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
>>>   && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
>
>>> I think.  It seems to imply that SLP isn't supported with
>>> masking/lengthing.
>
> Oh, yes.  At first glance, the original code is quite suspicious and your 
> comments make sense to me.
>
>>> Hum, how does CFN_EXTRACT_LAST handle both mask and length transparently?
>>> Don't you need some CFN_LEN_EXTRACT_LAST instead?
>
> I think CFN_EXTRACT_LAST always has either loop mask or loop len.
>
> When both mask and length are not needed, IMHO, I think current BIT_FIELD_REF 
> flow is good enough:
> https://godbolt.org/z/Yr5M9hcc6

I'm a bit behind of email, but why isn't BIT_FIELD_REF enough for
the case that the patch is handling?  It seems that:

  .EXTRACT_LAST (len, vec)

is equivalent to:

  vec[len - 1]

I think eventually there'll be the temptation to lower/fold it like that.

FWIW, I agree a IFN_LEN_EXTRACT_LAST/IFN_EXTRACT_LAST_LEN would be OK,
with a mask, vector, length and bias.  But even then, I think there'll
be a temptation to lower calls with all-1 masks to val[len - 1 - bias].
So I think the function only makes sense if we have a use case where
the mask might not be all-1s.

Thanks,
Richard


Re: [PATCH] aarch64: SVE/NEON Bridging intrinsics

2023-08-09 Thread Richard Sandiford via Gcc-patches
Richard Ball  writes:
> ACLE has added intrinsics to bridge between SVE and Neon.
>
> The NEON_SVE Bridge adds intrinsics that allow conversions between NEON and
> SVE vectors.
>
> This patch adds support to GCC for the following 3 intrinsics:
> svset_neonq, svget_neonq and svdup_neonq
>
> gcc/ChangeLog:
>
>   * config.gcc: Adds new header to config.
>   * config/aarch64/aarch64-builtins.cc (GTY): Externs aarch64_simd_types.
>   * config/aarch64/aarch64-c.cc (aarch64_pragma_aarch64):
>   Defines pragma for arm_neon_sve_bridge.h.
>   * config/aarch64/aarch64-protos.h: New function.
>   * config/aarch64/aarch64-sve-builtins-base.h: New intrinsics.
>   * config/aarch64/aarch64-sve-builtins-base.cc
>   (class svget_neonq_impl): New intrinsic implementation.
>   (class svset_neonq_impl): Likewise.
>   (class svdup_neonq_impl): Likewise.
>   (NEON_SVE_BRIDGE_FUNCTION): New intrinsics.
>   * config/aarch64/aarch64-sve-builtins-functions.h
>   (NEON_SVE_BRIDGE_FUNCTION): Defines macro for NEON_SVE_BRIDGE 
> functions.
>   * config/aarch64/aarch64-sve-builtins-shapes.h: New shapes.
>   * config/aarch64/aarch64-sve-builtins-shapes.cc
>   (parse_neon_type): Parser for NEON types.
>   (parse_element_type): Add NEON element types.
>   (parse_type): Likewise.
>   (NEON_SVE_BRIDGE_SHAPE): Defines macro for NEON_SVE_BRIDGE shapes.
>   (struct get_neonq_def): Defines function shape for get_neonq.
>   (struct set_neonq_def): Defines function shape for set_neonq.
>   (struct dup_neonq_def): Defines function shape for dup_neonq.
>   * config/aarch64/aarch64-sve-builtins.cc (DEF_NEON_SVE_FUNCTION): 
> Defines
>   macro for NEON_SVE_BRIDGE functions.
>   (handle_arm_neon_sve_bridge_h): Handles #pragma arm_neon_sve_bridge.h.
>   * config/aarch64/aarch64-builtins.h: New header file to extern neon 
> types.
>   * config/aarch64/aarch64-neon-sve-bridge-builtins.def: New instrinsics
>   function def file.
>   * config/aarch64/arm_neon_sve_bridge.h: New header file.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.c-torture/execute/neon-sve-bridge.c: New test.
>
> #
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 
> d88071773c9e1280cc5f38e36e09573214323b48..ca55992200dbe58782c3dbf66906339de021ba6b
>  
> 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -334,7 +334,7 @@ m32c*-*-*)
>;;
>aarch64*-*-*)
>   cpu_type=aarch64
> - extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h"
> + extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h 
> arm_neon_sve_bridge.h"
>   c_target_objs="aarch64-c.o"
>   cxx_target_objs="aarch64-c.o"
>   d_target_objs="aarch64-d.o"
> diff --git a/gcc/config/aarch64/aarch64-builtins.h 
> b/gcc/config/aarch64/aarch64-builtins.h
> new file mode 100644
> index 
> ..eebde448f92c230c8f88b4da1ca8ebd9670b1536
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-builtins.h
> @@ -0,0 +1,86 @@
> +/* Builtins' description for AArch64 SIMD architecture.
> +   Copyright (C) 2023 Free Software Foundation, Inc.
> +   This file is part of GCC.
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +#ifndef GCC_AARCH64_BUILTINS_H
> +#define GCC_AARCH64_BUILTINS_H
> +#include "tree.h"

It looks like the include shouldn't be needed.  tree is forward-declared
in coretypes.h, which is included everywhere.

> +enum aarch64_type_qualifiers
> +{
> +  /* T foo.  */
> +  qualifier_none = 0x0,
> +  /* unsigned T foo.  */
> +  qualifier_unsigned = 0x1, /* 1 << 0  */
> +  /* const T foo.  */
> +  qualifier_const = 0x2, /* 1 << 1  */
> +  /* T *foo.  */
> +  qualifier_pointer = 0x4, /* 1 << 2  */
> +  /* Used when expanding arguments if an operand could
> + be an immediate.  */
> +  qualifier_immediate = 0x8, /* 1 << 3  */
> +  qualifier_maybe_immediate = 0x10, /* 1 << 4  */
> +  /* void foo (...).  */
> +  qualifier_void = 0x20, /* 1 << 5  */
> +  /* 1 << 6 is now unused */
> +  /* Some builtins should use the T_*mode* encoded in a simd_builtin_datum
> + rather than using the type of the operand.  */
> +  qualifier_map_mode = 0x80, /* 1 << 7  */
> +  /* qualifier_pointer | qualifier_map_mode  */
> +  

Re: [PATCH][GCC] aarch64: Add support for Cortex-A520 CPU

2023-08-08 Thread Richard Sandiford via Gcc-patches
Richard Ball  writes:
> This patch adds support for the Cortex-A520 CPU to GCC.
>
> No regressions on aarch64-none-elf.
>
> Ok for master?
>
>
> gcc/ChangeLog:
>
>      * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 
> Cortex-A520 CPU.
>      * config/aarch64/aarch64-tune.md: Regenerate.
>      * doc/invoke.texi: Document Cortex-A520 CPU.

OK, thanks.

Richard

> ###
>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> 2ec88c98400d5a2d7bdb954baca9e2664d2885ac..dbac497ef3aab410eb81db185b2e9532186888bb
>  
> 100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -170,6 +170,8 @@ AARCH64_CORE("cortex-r82", cortexr82, cortexa53, 
> V8R, (), cortexa53, 0x41, 0xd15
>   /* Arm ('A') cores. */
>   AARCH64_CORE("cortex-a510",  cortexa510, cortexa55, V9A, 
> (SVE2_BITPERM, MEMTAG, I8MM, BF16), cortexa53, 0x41, 0xd46, -1)
>
> +AARCH64_CORE("cortex-a520",  cortexa520, cortexa55, V9_2A, 
> (SVE2_BITPERM, MEMTAG), cortexa53, 0x41, 0xd80, -1)
> +
>   AARCH64_CORE("cortex-a710",  cortexa710, cortexa57, V9A, 
> (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd47, -1)
>
>   AARCH64_CORE("cortex-a715",  cortexa715, cortexa57, V9A, 
> (SVE2_BITPERM, MEMTAG, I8MM, BF16), neoversen2, 0x41, 0xd4d, -1)
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 4fd35fa4884617b901b9ae6faea2f39975c4f4b2..2170980dddb0d5d410a49631ad26ff2e346b39dd
>  
> 100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
>   ;; -*- buffer-read-only: t -*-
>   ;; Generated automatically by gentune.sh from aarch64-cores.def
>   (define_attr "tune"
> - 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
> + 
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88p1,thunderxt88,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,ares,neoversen1,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,zeus,neoversev1,neoverse512tvb,saphira,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexx2,cortexx3,neoversen2,demeter,neoversev2"
>   (const (symbol_ref "((enum attr_tune) aarch64_tune)")))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 104766f446d118d30e9e2bfd6cd485255f54ab5f..2c870d3c34b587ffc721b1f18f99ecd66d4217be
>  
> 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -20516,8 +20516,8 @@ performance of the code.  Permissible values for 
> this option are:
>   @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},
>   @samp{cortex-a75.cortex-a55}, @samp{cortex-a76.cortex-a55},
>   @samp{cortex-r82}, @samp{cortex-x1}, @samp{cortex-x1c}, @samp{cortex-x2},
> -@samp{cortex-x3}, @samp{cortex-a510}, @samp{cortex-a710}, 
> @samp{cortex-a715},
> -@samp{ampere1}, @samp{ampere1a}, and @samp{native}.
> +@samp{cortex-x3}, @samp{cortex-a510}, @samp{cortex-a520}, 
> @samp{cortex-a710},
> +@samp{cortex-a715}, @samp{ampere1}, @samp{ampere1a}, and @samp{native}.
>
>   The values @samp{cortex-a57.cortex-a53}, @samp{cortex-a72.cortex-a53},
>   @samp{cortex-a73.cortex-a35}, @samp{cortex-a73.cortex-a53},


Re: [PATCH] aarch64: enable mixed-types for aarch64 simdclones

2023-08-08 Thread Richard Sandiford via Gcc-patches
"Andre Vieira (lists)"  writes:
> Hi,
>
> This patch enables the use of mixed-types for simd clones for AArch64 
> and adds aarch64 as a target_vect_simd_clones.
>
> Bootstrapped and regression tested on aarch64-unknown-linux-gnu
>
> gcc/ChangeLog:
>
>  * config/aarch64/aarch64.cc (currently_supported_simd_type): 
> Remove.
>  (aarch64_simd_clone_compute_vecsize_and_simdlen): Use NFS type 
> to determine simdlen.
>
> gcc/testsuite/ChangeLog:
>
>  * lib/target-supports.exp: Add aarch64 targets to vect_simd_clones.
>  * c-c++-common/gomp/declare-variant-14.c: Add aarch64 checks 
> and remove warning check.
>  * g++.dg/gomp/attrs-10.C: Likewise.
>  * g++.dg/gomp/declare-simd-1.C: Likewise.
>  * g++.dg/gomp/declare-simd-3.C: Likewise.
>  * g++.dg/gomp/declare-simd-4.C: Likewise.
>  * gcc.dg/gomp/declare-simd-3.c: Likewise.
>  * gcc.dg/gomp/simd-clones-2.c: Likewise.
>  * gfortran.dg/gomp/declare-variant-14.f90: Likewise.
>  * c-c++-common/gomp/pr60823-1.c: Remove warning check.
>  * c-c++-common/gomp/pr60823-3.c: Likewise.
>  * g++.dg/gomp/declare-simd-7.C: Likewise.
>  * g++.dg/gomp/declare-simd-8.C: Likewise.
>  * g++.dg/gomp/pr88182.C: Likewise.
>  * gcc.dg/declare-simd.c: Likewise.
>  * gcc.dg/gomp/declare-simd-1.c: Likewise.
>  * gcc.dg/gomp/pr87895-1.c: Likewise.
>  * gfortran.dg/gomp/declare-simd-2.f90: Likewise.
>  * gfortran.dg/gomp/declare-simd-coarray-lib.f90: Likewise.
>  * gfortran.dg/gomp/pr79154-1.f90: Likewise.
>  * gfortran.dg/gomp/pr83977.f90: Likewise.
>  * gcc.dg/gomp/pr87887-1.c: Add warning test.
>  * gcc.dg/gomp/pr89246-1.c: Likewise.
>  * gcc.dg/gomp/pr99542.c: Update warning test.
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 560e5431636ef46c41d56faa0c4e95be78f64b50..ac6350a44481628a947a0f20e034acf92cde63ec
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27194,21 +27194,6 @@ supported_simd_type (tree t)
>return false;
>  }
>  
> -/* Return true for types that currently are supported as SIMD return
> -   or argument types.  */
> -
> -static bool
> -currently_supported_simd_type (tree t, tree b)
> -{
> -  if (COMPLEX_FLOAT_TYPE_P (t))
> -return false;
> -
> -  if (TYPE_SIZE (t) != TYPE_SIZE (b))
> -return false;
> -
> -  return supported_simd_type (t);
> -}
> -
>  /* Implement TARGET_SIMD_CLONE_COMPUTE_VECSIZE_AND_SIMDLEN.  */
>  
>  static int
> @@ -27217,7 +27202,7 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>   tree base_type, int num,
>   bool explicit_p)
>  {
> -  tree t, ret_type;
> +  tree t, ret_type, nfs_type;
>unsigned int elt_bits, count;
>unsigned HOST_WIDE_INT const_simdlen;
>poly_uint64 vec_bits;
> @@ -27240,55 +27225,61 @@ aarch64_simd_clone_compute_vecsize_and_simdlen 
> (struct cgraph_node *node,
>  }
>  
>ret_type = TREE_TYPE (TREE_TYPE (node->decl));
> +  /* According to AArch64's Vector ABI the type that determines the simdlen 
> is
> + the narrowest of types, so we ignore base_type for AArch64.  */
>if (TREE_CODE (ret_type) != VOID_TYPE
> -  && !currently_supported_simd_type (ret_type, base_type))
> +  && !supported_simd_type (ret_type))
>  {
>if (!explicit_p)
>   ;
> -  else if (TYPE_SIZE (ret_type) != TYPE_SIZE (base_type))
> - warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
> - "GCC does not currently support mixed size types "
> - "for % functions");
> -  else if (supported_simd_type (ret_type))
> +  else if (COMPLEX_FLOAT_TYPE_P (ret_type))
>   warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
>   "GCC does not currently support return type %qT "
> - "for % functions", ret_type);
> + "for simd", ret_type);
>else
>   warning_at (DECL_SOURCE_LOCATION (node->decl), 0,
> - "unsupported return type %qT for % functions",
> + "unsupported return type %qT for simd",
>   ret_type);

What's the reason for s/% functions/simd/, in particular for
dropping the quotes around simd?

>return 0;
>  }
>  
> +  nfs_type = ret_type;

Genuine question, but what does nfs stand for in this context?

>int i;
>tree type_arg_types = TYPE_ARG_TYPES (TREE_TYPE (node->decl));
>bool decl_arg_p = (node->definition || type_arg_types == NULL_TREE);
> -
>for (t = (decl_arg_p ? DECL_ARGUMENTS (node->decl) : type_arg_types), i = 
> 0;
> t && t != void_list_node; t = TREE_CHAIN (t), i++)
>  {
>tree arg_type = decl_arg_p ? TREE_TYPE (t) : TREE_VALUE (t);
> -
>if (clonei->args[i].arg_type != SIMD_CLONE_ARG_TYPE_UNIFORM
> -  

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-08 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
>  wrote:
>>
>> Full review this time, sorry for the skipping the tests earlier.
> Thanks for the detailed review! Please find my responses inline below.
>>
>> Prathamesh Kulkarni  writes:
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 7e5494dfd39..680d0e54fd4 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include 
>>
>> This should be included by defining INCLUDE_ALGORITHM instead.
> Done. Just curious, why do we use this macro instead of directly
> including  ?

AIUI, one of the reasons for having every file start with includes
of config.h and (b)system.h, in that order, is to ensure that a small
and predictable amount of GCC-specific stuff happens before including
the system header files.  That helps to avoid OS-specific clashes between
GCC code and system headers.

But another major reason is that system.h ends by poisoning a lot of
stuff that system headers would be entitled to use.

>> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
>> > +
>> > +  // Fill a0 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 1)
>> > +return builder.build ();
>> > +
>> > +  // Fill a1 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 2)
>> > +return builder.build ();
>> > +
>> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
>> > +{
>> > +  tree prev_elem = builder[i - npatterns];
>> > +  int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
>> > +  int val = prev_elem_val + S;
>> > +  builder.quick_push (build_int_cst (inner_type, val));
>> > +}
>> > +
>> > +  return builder.build ();
>> > +}
>> > +
>> > +static void
>> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
>> > +   tree res, tree *expected_res)
>> > +{
>> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
>> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
>>
>> I don't think this is safe when the inputs are randomised.  E.g. we
>> could by chance end up with a vector of all zeros, which would have
>> a single pattern and a single element per pattern, regardless of the
>> shapes of the inputs.
>>
>> Given the way that vector_builder::finalize
>> canonicalises the encoding, it should be safe to use:
>>
>> * VECTOR_CST_NPATTERNS (res) <= npatterns
>> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
>>
>> If we do that then...
>>
>> > +
>> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
>>
>> ...this loop bound should be npatterns * nelts_per_pattern instead.
> Ah indeed. Fixed, thanks.

The patch instead does:

  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);

I think the version I suggested is safer.  It's not the goal of the
canonicalisation algorithm to reduce both npattners and nelts_per_pattern
individually.  The algorithm can increase nelts_per_pattern in order
to decrease npatterns.

>> > +  {
>> > +tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +vec_perm_builder builder (arg0_len, 1, 3);
>> > +builder.quick_push (arg0_len);
>> > +builder.quick_push (arg0_len + 1);
>> > +builder.quick_push (arg0_len + 2);
>> > +
>> > +vec_perm_indices sel (builder, 2, arg0_len);
>> > +tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, 
>> > NULL, true);
>> > +tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt 
>> > (arg1, 1),
>> > + vector_cst_elt (arg1, 2) };
>> > +validate_res (1, 3, res, expected_res);
>> > +  }
>> > +
>> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
>> > + sel = {len, 0, 0, 0, 2, 0, ...}
>> > + npatterns = 2, nelts_per_pattern = 3.
>> > + Use extra pattern {0, ...} to lower number of elements per pattern.  
>> > */
>> > +  {
>> > +tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +vec_perm_builder builder (arg0_len, 2, 3);
>> > +builder.quick_push (arg0_len);
>> > +int mask_elems[] = { 0, 0, 0, 2, 0 };
>> > +for (int i = 0; i < 5; i++)
>> > +  builder.quick_push 

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-04 Thread Richard Sandiford via Gcc-patches
Full review this time, sorry for the skipping the tests earlier.

Prathamesh Kulkarni  writes:
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 7e5494dfd39..680d0e54fd4 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "vec-perm-indices.h"
>  #include "asan.h"
>  #include "gimple-range.h"
> +#include 

This should be included by defining INCLUDE_ALGORITHM instead.

> +#include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "print-tree.h"

Are these still needed, or were they for debugging?

>  
>  /* Nonzero if we are folding constants inside an initializer or a C++
> manifestly-constant-evaluated context; zero otherwise.
> @@ -10494,15 +10498,9 @@ fold_mult_zconjz (location_t loc, tree type, tree 
> expr)
>  static bool
>  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>  {
> -  unsigned HOST_WIDE_INT i, nunits;
> +  unsigned HOST_WIDE_INT i;
>  
> -  if (TREE_CODE (arg) == VECTOR_CST
> -  && VECTOR_CST_NELTS (arg).is_constant ())
> -{
> -  for (i = 0; i < nunits; ++i)
> - elts[i] = VECTOR_CST_ELT (arg, i);
> -}
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> +  if (TREE_CODE (arg) == CONSTRUCTOR)
>  {
>constructor_elt *elt;
>  
> @@ -10520,6 +10518,192 @@ vec_cst_ctor_to_array (tree arg, unsigned int 
> nelts, tree *elts)
>return true;
>  }
>  
> +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
> +   mask for VLA vec_perm folding.
> +   REASON if specified, will contain the reason why SEL is not suitable.
> +   Used only for debugging and unit-testing.
> +   VERBOSE if enabled is used for debugging output.  */
> +
> +static bool
> +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
> + const vec_perm_indices ,
> + const char **reason = NULL,
> + ATTRIBUTE_UNUSED bool verbose = false)

Since verbose is no longer needed (good!), I think we should just remove it.

> +{
> +  unsigned sel_npatterns = sel.encoding ().npatterns ();
> +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +
> +  if (!(pow2p_hwi (sel_npatterns)
> + && pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
> + && pow2p_hwi (VECTOR_CST_NPATTERNS (arg1
> +{
> +  if (reason)
> + *reason = "npatterns is not power of 2";
> +  return false;
> +}
> +
> +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
> + For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
> +  poly_uint64 esel;
> +  if (!multiple_p (sel.length (), sel_npatterns, ))
> +{
> +  if (reason)
> + *reason = "sel.length is not multiple of sel_npatterns";
> +  return false;
> +}
> +
> +  if (sel_nelts_per_pattern < 3)
> +return true;
> +
> +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
> +{
> +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> +  poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> +  HOST_WIDE_INT S; 

Trailing whitespace.  The convention is to use lowercase variable
names, so please call this "step".

> +  if (!poly_int64 (a2 - a1).is_constant ())
> + {
> +   if (reason)
> + *reason = "step is not constant";
> +   return false;
> + }
> +  // FIXME: Punt on S < 0 for now, revisit later.
> +  if (S < 0)
> + return false;
> +  if (S == 0)
> + continue;
> +
> +  if (!pow2p_hwi (S))
> + {
> +   if (reason)
> + *reason = "step is not power of 2";
> +   return false;
> + }
> +
> +  /* Ensure that stepped sequence of the pattern selects elements
> +  only from the same input vector if it's VLA.  */

s/ if it's VLA//

> +  uint64_t q1, qe;
> +  poly_uint64 r1, re;
> +  poly_uint64 ae = a1 + (esel - 2) * S;
> +  poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +  if (!(can_div_trunc_p (a1, arg_len, , )
> + && can_div_trunc_p (ae, arg_len, , )
> + && q1 == qe))
> + {
> +   if (reason)
> + *reason = "crossed input vectors";
> +   return false;
> + }
> +

Probably worth a comment above the following code too:

  /* Ensure that the stepped sequence always selects from the same
 input pattern.  */

> +  unsigned arg_npatterns
> + = ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> +   : VECTOR_CST_NPATTERNS (arg1);
> +
> +  if (!multiple_p (S, arg_npatterns))
> + {
> +   if (reason)
> + *reason = "S is not multiple of npatterns";
> +   return false;
> + }
> +}
> +
> +  return true;
> +}
> +
> +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> +   REASON and VERBOSE have same purpose as described in
> +   

Re: [PATCH] tree-optimization/110838 - vectorization of widened right shifts

2023-08-04 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> The following fixes a problem with my last attempt of avoiding
> out-of-bound shift values for vectorized right shifts of widened
> operands.  Instead of truncating the shift amount with a bitwise
> and we actually need to saturate it to the target precision.
>
> The following does that and adds test coverage for the constant
> and invariant but variable case that would previously have failed.
>
> Bootstrap & regtest on x86_64-unknown-linux-gnu in progress, I plan
> to push this soon, just in case you have any comments here.

LGTM FWIW.

Richard

> Richard.
>
>   PR tree-optimization/110838
>   * tree-vect-patterns.cc (vect_recog_over_widening_pattern):
>   Fix right-shift value sanitizing.  Properly emit external
>   def mangling in the preheader rather than in the pattern
>   def sequence where it will fail vectorizing.
>
>   * gcc.dg/vect/pr110838.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/pr110838.c | 31 
>  gcc/tree-vect-patterns.cc| 22 +++-
>  2 files changed, 48 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/pr110838.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/pr110838.c 
> b/gcc/testsuite/gcc.dg/vect/pr110838.c
> new file mode 100644
> index 000..cf8765be603
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr110838.c
> @@ -0,0 +1,31 @@
> +/* { dg-do run } */
> +
> +#include "tree-vect.h"
> +
> +short a[32], b[32];
> +
> +void __attribute__((noipa)) foo ()
> +{
> +  for (int i = 0; i < 32; ++i)
> +a[i] = b[i] >> 16;
> +}
> +
> +void __attribute__((noipa)) bar (int n)
> +{
> +  int np = n & 31;
> +  for (int i = 0; i < 32; ++i)
> +a[i] = b[i] >> np;
> +}
> +
> +int main ()
> +{
> +  check_vect ();
> +  b[0] = -8;
> +  foo ();
> +  if (a[0] != -1)
> +abort ();
> +  bar (16);
> +  if (a[0] != -1)
> +abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index e4ab8c2d65b..2cedf238450 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3109,8 +3109,8 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
>wide_int min_value, max_value;
>if (TREE_CODE (ops[1]) == INTEGER_CST)
>   ops[1] = wide_int_to_tree (op_type,
> -wi::bit_and (wi::to_wide (ops[1]),
> - new_precision - 1));
> +wi::umin (wi::to_wide (ops[1]),
> +  new_precision - 1));
>else if (!vect_get_range_info (ops[1], _value, _value)
>  || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
>   {
> @@ -3118,11 +3118,23 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
>same argument widened shifts and it un-CSEs same arguments.  */
> tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
> gimple *pattern_stmt
> - = gimple_build_assign (new_var, BIT_AND_EXPR, ops[1],
> + = gimple_build_assign (new_var, MIN_EXPR, ops[1],
>  build_int_cst (op_type, new_precision - 1));
> -   ops[1] = new_var;
> gimple_set_location (pattern_stmt, gimple_location (last_stmt));
> -   append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
> +   if (unprom[1].dt == vect_external_def)
> + {
> +   if (edge e = vect_get_external_def_edge (vinfo, ops[1]))
> + {
> +   basic_block new_bb
> + = gsi_insert_on_edge_immediate (e, pattern_stmt);
> +   gcc_assert (!new_bb);
> + }
> +   else
> + return NULL;
> + }
> +   else
> + append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
> +   ops[1] = new_var;
>   }
>  }


Re: [RFC] Combine zero_extract and sign_extend for TARGET_TRULY_NOOP_TRUNCATION

2023-08-04 Thread Richard Sandiford via Gcc-patches
YunQiang Su  writes:
> PR #104914
>
> On TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == true platforms,
> zero_extract (SI, SI) can be sign-extended.  So, if a zero_extract (DI,
> DI) following with an sign_extend(SI, DI) can be merged to a single
> zero_extract (SI, SI).
>
> gcc/ChangeLog:
>   PR: 104914.
>   * combine.cc (try_combine): Combine zero_extract (DI, DI) and
> following sign_extend (DI, SI) for
> TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == true.
> (subst): Allow replacing reg(DI) with subreg(SI (reg DI))
> if to is SImode and from is DImode for
> TRULY_NOOP_TRUNCATION_MODES_P (DImode, SImode)) == true.
>
> gcc/testsuite/ChangeLog:
>   PR: 104914.
>   * gcc.target/mips/pr104914.c: New testcase.
> ---
>  gcc/combine.cc   | 88 
>  gcc/testsuite/gcc.target/mips/pr104914.c | 17 +
>  2 files changed, 90 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/mips/pr104914.c
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index e46d202d0a7..701b7c33b17 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -3294,15 +3294,64 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn 
> *i1, rtx_insn *i0,
>n_occurrences = 0; /* `subst' counts here */
>subst_low_luid = DF_INSN_LUID (i2);
>  
> -  /* If I1 feeds into I2 and I1DEST is in I1SRC, we need to make a unique
> -  copy of I2SRC each time we substitute it, in order to avoid creating
> -  self-referential RTL when we will be substituting I1SRC for I1DEST
> -  later.  Likewise if I0 feeds into I2, either directly or indirectly
> -  through I1, and I0DEST is in I0SRC.  */
> -  newpat = subst (PATTERN (i3), i2dest, i2src, false, false,
> -   (i1_feeds_i2_n && i1dest_in_i1src)
> -   || ((i0_feeds_i2_n || (i0_feeds_i1_n && i1_feeds_i2_n))
> -   && i0dest_in_i0src));
> +  /* Try to combine zero_extract (DImode) and sign_extend (SImode to 
> DImode)
> +  for TARGET_TRULY_NOOP_TRUNCATION.  The RTL may look like:
> +
> +  (insn 10 49 11 2 (set (zero_extract:DI (reg/v:DI 200 [ val ])
> + (const_int 8 [0x8])
> + (const_int 0 [0]))
> +  (subreg:DI (reg:QI 202 [ *buf_8(D) ]) 0)) "xx.c":4:29 278 {*insvdi}
> +  (expr_list:REG_DEAD (reg:QI 202 [ *buf_8(D) ]) (nil)))
> +  (insn 11 10 12 2 (set (reg/v:DI 200 [ val ])
> +
> +  (sign_extend:DI (subreg:SI (reg/v:DI 200 [ val ]) 0))) 238 
> {extendsidi2}
> +  (nil))

Like I mentioned in the other thread, I think things went wrong when
we generated the subreg in this sign_extend.  The operation should
have been a truncate of (reg/v:DI 200) followed by a sign extension
of the result.

What piece of code is generating the subreg?

Thanks,
Richard


Re: [PATCH]AArch64 update costing for MLA by invariant

2023-08-03 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> >> Do you see vect_constant_defs in practice, or is this just for 
>> >> completeness?
>> >> I would expect any constants to appear as direct operands.  I don't
>> >> mind keeping it if it's just a belt-and-braces thing though.
>> >
>> > In the latency case where I had allow_constants the early rejection
>> > based on the operand itself wouldn't be rejected so in that case I
>> > still needed to reject them but do so after the multiply check.  While
>> > they do appear as direct operands as well they also have their own
>> > nodes, in particular for SLP so the constants are handled as a group.
>> 
>> Ah, OK, thanks.
>> 
>> > But can also check CONSTANT_CLASS_P (rhs) if that's preferrable.
>> 
>> No, what you did is more correct.  I just wasn't sure at first which case it 
>> was
>> handling.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_multiply_add_p): Update handling
>   of constants. 
>   (aarch64_adjust_stmt_cost): Use it.
>   (aarch64_vector_costs::count_ops): Likewise.
>   (aarch64_vector_costs::add_stmt_cost): Pass vinfo to
>   aarch64_adjust_stmt_cost.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> d4d7602554592b9042b8eaf389eff1ec80c2090e..7cc5916ce06b2635346c807da9306738b939ebc6
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16410,10 +16410,6 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>if (code != PLUS_EXPR && code != MINUS_EXPR)
>  return false;
>  
> -  if (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
> -  || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign)))
> -return false;
> -
>for (int i = 1; i < 3; ++i)
>  {
>tree rhs = gimple_op (assign, i);
> @@ -16441,7 +16437,8 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>   return false;
> def_stmt_info = vinfo->lookup_def (rhs);
> if (!def_stmt_info
> -   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
> +   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def
> +   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_constant_def)
>   return false;
>   }
>  
> @@ -16721,8 +16718,9 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, 
> vect_cost_for_stmt kind,
> and which when vectorized would operate on vector type VECTYPE.  Add the
> cost of any embedded operations.  */
>  static fractional_cost
> -aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
> -   tree vectype, fractional_cost stmt_cost)
> +aarch64_adjust_stmt_cost (vec_info *vinfo, vect_cost_for_stmt kind,
> +   stmt_vec_info stmt_info, tree vectype,
> +   unsigned vec_flags, fractional_cost stmt_cost)
>  {
>if (vectype)
>  {
> @@ -16745,6 +16743,14 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
> stmt_vec_info stmt_info,
> break;
>   }
>  
> +  gassign *assign = dyn_cast (STMT_VINFO_STMT (stmt_info));
> +  if (assign && !vect_is_reduction (stmt_info))
> + {
> +   /* For MLA we need to reduce the cost since MLA is 1 instruction.  */
> +   if (aarch64_multiply_add_p (vinfo, stmt_info, vec_flags))
> + return 0;
> + }
> +
>if (kind == vector_stmt || kind == vec_to_scalar)
>   if (tree cmp_type = vect_embedded_comparison_type (stmt_info))
> {
> @@ -16814,7 +16820,8 @@ aarch64_vector_costs::count_ops (unsigned int count, 
> vect_cost_for_stmt kind,
>  }
>  
>/* Assume that multiply-adds will become a single operation.  */
> -  if (stmt_info && aarch64_multiply_add_p (m_vinfo, stmt_info, m_vec_flags))
> +  if (stmt_info
> +  && aarch64_multiply_add_p (m_vinfo, stmt_info, m_vec_flags))
>  return;
>  
>/* Count the basic operation cost associated with KIND.  */

There's no need for this change now that there's no extra parameter.

OK with that change, thanks.

Richard

> @@ -17060,8 +17067,8 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>  {
>/* Account for any extra "embedded" costs that apply additively
>to the base cost calculated above.  */
> -  stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
> - stmt_cost);
> +  stmt_cost = aarch64_adjust_stmt_cost (m_vinfo, kind, stmt_info,
> + vectype, m_vec_flags, stmt_cost);
>  
>/* If we're recording a nonzero vector loop body cost for the
>innermost loop, also estimate the operations that would need


Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-03 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Prathamesh Kulkarni  writes:
>> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
>>  wrote:
>>>
>>> Hi,
>>>
>>> Thanks for the rework and sorry for the slow review.
>> Hi Richard,
>> Thanks for the suggestions!  Please find my responses inline below.
>>>
>>> Prathamesh Kulkarni  writes:
>>> > Hi Richard,
>>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
>>> > The attached patch unifies handling of VLS and VLA vector_csts, while
>>> > using fallback code
>>> > for ctors.
>>> >
>>> > For VLS vector, the patch ignores underlying encoding, and
>>> > uses npatterns = nelts, and nelts_per_pattern = 1.
>>> >
>>> > For VLA patterns, if sel has a stepped sequence, then it
>>> > only chooses elements from a particular pattern of a particular
>>> > input vector.
>>> >
>>> > To make things simpler, the patch imposes following constraints:
>>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
>>> > (b) The step size for a stepped sequence is a power of 2, and
>>> >   multiple of npatterns of chosen input vector.
>>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
>>> >  So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>>> >
>>> > Eg:
>>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
>>> > op0_len = op1_len = 16 + 16x.
>>> > sel = { 0, 0, 2, 0, 4, 0, ... }
>>> > npatterns = 2, nelts_per_pattern = 3.
>>> >
>>> > For pattern {0, 2, 4, ...}
>>> > Let,
>>> > a1 = 2
>>> > S = step size = 2
>>> >
>>> > Let Esel denote number of elements per pattern in sel at runtime.
>>> > Esel = (16 + 16x) / npatterns_sel
>>> > = (16 + 16x) / 2
>>> > = (8 + 8x)
>>> >
>>> > So, last element of pattern:
>>> > ae = a1 + (Esel - 2) * S
>>> >  = 2 + (8 + 8x - 2) * 2
>>> >  = 14 + 16x
>>> >
>>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
>>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
>>> > Since both are equal with quotient = 0, we select elements from op0.
>>> >
>>> > Since step size (S) is a multiple of npatterns(op0), we select
>>> > all elements from same pattern of op0.
>>> >
>>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>>> >= max (2, max (2, 2)
>>> >= 2
>>> >
>>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
>>> > max 
>>> > (op1_nelts_per_pattern,
>>> >  
>>> > sel_nelts_per_pattern))
>>> > = max (3, max (3, 3))
>>> > = 3
>>> >
>>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
>>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>>> >
>>> > Unfortunately, this results in an issue for poly_int_cst index:
>>> > For example,
>>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
>>> > op0_len = op1_len = 4 + 4x
>>> >
>>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>>> >
>>> > In this case,
>>> > a1 = 5 + 4x
>>> > S = (6 + 4x) - (5 + 4x) = 1
>>> > Esel = 4 + 4x
>>> >
>>> > ae = a1 + (esel - 2) * S
>>> >  = (5 + 4x) + (4 + 4x - 2) * 1
>>> >  = 7 + 8x
>>> >
>>> > IIUC, 7 + 8x will always be index for last element of op1 ?
>>> > if x = 0, len = 4, 7 + 8x = 7
>>> > if x = 1, len = 8, 7 + 8x = 15, etc.
>>> > So the stepped sequence will always choose elements
>>> > from op1 regardless of vector length for above case ?
>>> >
>>> > However,
>>> > ae /trunc op0_len
>>> > = (7 + 8x) / (4 + 4x)
>>> > which is not defined because 7/4 != 8/4
>>> > and we return NULL_TREE, but I suppose the expected result would be:
>>> > res: { op1[0], op1[1], op1[2], ... } ?
>>> >
>>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
>>> > and on x86_64-unknown-linux-gnu.
>>> > I would be grateful for suggestions on how to proceed.
>>> >
>>> > Thanks,
>>> > Prathamesh
>>> >
>>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>>> > index a02ede79fed..8028b3e8e9a 100644
>>> > --- a/gcc/fold-const.cc
>>> > +++ b/gcc/fold-const.cc
>>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>>> >  #include "vec-perm-indices.h"
>>> >  #include "asan.h"
>>> >  #include "gimple-range.h"
>>> > +#include 
>>> > +#include "tree-pretty-print.h"
>>> > +#include "gimple-pretty-print.h"
>>> > +#include "print-tree.h"
>>> >
>>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>>> > manifestly-constant-evaluated context; zero otherwise.
>>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, 
>>> > tree expr)
>>> >  static bool
>>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>>> >  {
>>> > -  unsigned HOST_WIDE_INT i, nunits;
>>> > +  unsigned HOST_WIDE_INT i;
>>> >
>>> > -  if (TREE_CODE (arg) == VECTOR_CST
>>> > -  && VECTOR_CST_NELTS (arg).is_constant ())
>>> > -{
>>> > -  for (i = 0; i < nunits; 

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

2023-08-03 Thread Richard Sandiford via Gcc-patches
Prathamesh Kulkarni  writes:
> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
>  wrote:
>>
>> Hi,
>>
>> Thanks for the rework and sorry for the slow review.
> Hi Richard,
> Thanks for the suggestions!  Please find my responses inline below.
>>
>> Prathamesh Kulkarni  writes:
>> > Hi Richard,
>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
>> > The attached patch unifies handling of VLS and VLA vector_csts, while
>> > using fallback code
>> > for ctors.
>> >
>> > For VLS vector, the patch ignores underlying encoding, and
>> > uses npatterns = nelts, and nelts_per_pattern = 1.
>> >
>> > For VLA patterns, if sel has a stepped sequence, then it
>> > only chooses elements from a particular pattern of a particular
>> > input vector.
>> >
>> > To make things simpler, the patch imposes following constraints:
>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
>> > (b) The step size for a stepped sequence is a power of 2, and
>> >   multiple of npatterns of chosen input vector.
>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
>> >  So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>> >
>> > Eg:
>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
>> > op0_len = op1_len = 16 + 16x.
>> > sel = { 0, 0, 2, 0, 4, 0, ... }
>> > npatterns = 2, nelts_per_pattern = 3.
>> >
>> > For pattern {0, 2, 4, ...}
>> > Let,
>> > a1 = 2
>> > S = step size = 2
>> >
>> > Let Esel denote number of elements per pattern in sel at runtime.
>> > Esel = (16 + 16x) / npatterns_sel
>> > = (16 + 16x) / 2
>> > = (8 + 8x)
>> >
>> > So, last element of pattern:
>> > ae = a1 + (Esel - 2) * S
>> >  = 2 + (8 + 8x - 2) * 2
>> >  = 14 + 16x
>> >
>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
>> > Since both are equal with quotient = 0, we select elements from op0.
>> >
>> > Since step size (S) is a multiple of npatterns(op0), we select
>> > all elements from same pattern of op0.
>> >
>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>> >= max (2, max (2, 2)
>> >= 2
>> >
>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
>> > max (op1_nelts_per_pattern,
>> >  
>> > sel_nelts_per_pattern))
>> > = max (3, max (3, 3))
>> > = 3
>> >
>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>> >
>> > Unfortunately, this results in an issue for poly_int_cst index:
>> > For example,
>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
>> > op0_len = op1_len = 4 + 4x
>> >
>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>> >
>> > In this case,
>> > a1 = 5 + 4x
>> > S = (6 + 4x) - (5 + 4x) = 1
>> > Esel = 4 + 4x
>> >
>> > ae = a1 + (esel - 2) * S
>> >  = (5 + 4x) + (4 + 4x - 2) * 1
>> >  = 7 + 8x
>> >
>> > IIUC, 7 + 8x will always be index for last element of op1 ?
>> > if x = 0, len = 4, 7 + 8x = 7
>> > if x = 1, len = 8, 7 + 8x = 15, etc.
>> > So the stepped sequence will always choose elements
>> > from op1 regardless of vector length for above case ?
>> >
>> > However,
>> > ae /trunc op0_len
>> > = (7 + 8x) / (4 + 4x)
>> > which is not defined because 7/4 != 8/4
>> > and we return NULL_TREE, but I suppose the expected result would be:
>> > res: { op1[0], op1[1], op1[2], ... } ?
>> >
>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
>> > and on x86_64-unknown-linux-gnu.
>> > I would be grateful for suggestions on how to proceed.
>> >
>> > Thanks,
>> > Prathamesh
>> >
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index a02ede79fed..8028b3e8e9a 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include 
>> > +#include "tree-pretty-print.h"
>> > +#include "gimple-pretty-print.h"
>> > +#include "print-tree.h"
>> >
>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>> > manifestly-constant-evaluated context; zero otherwise.
>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree 
>> > expr)
>> >  static bool
>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> >  {
>> > -  unsigned HOST_WIDE_INT i, nunits;
>> > +  unsigned HOST_WIDE_INT i;
>> >
>> > -  if (TREE_CODE (arg) == VECTOR_CST
>> > -  && VECTOR_CST_NELTS (arg).is_constant ())
>> > -{
>> > -  for (i = 0; i < nunits; ++i)
>> > - elts[i] = VECTOR_CST_ELT (arg, i);
>> > -}
>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
>> >  {
>> 

Re: [PATCH]AArch64 Undo vec_widen_shiftl optabs [PR106346]

2023-08-03 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> > +
>> > +(define_constraint "D3"
>> > +  "@internal
>> > + A constraint that matches vector of immediates that is with 0 to
>> > +(bits(mode)/2)-1."
>> > + (and (match_code "const,const_vector")
>> > +  (match_test "aarch64_const_vec_all_same_in_range_p (op, 0,
>> > +  (GET_MODE_UNIT_BITSIZE (mode) / 2) - 1)")))
>> 
>> Having this mapping for D2 and D3, with D2 corresponded to prec/2, kind-of
>> makes D3 a false mnemonic.  How about DL instead?  (L for "left-shift long" 
>> or
>> "low-part", take your pick)
>> 
>> Looks good otherwise.
>> 
>
> Wasn't sure if this was an ok with changes or not, so here's the final patch 

I was hoping to have another look before it went in.  But...

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?

...yeah, LGTM, thanks.

Richard

> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR target/106346
>   * config/aarch64/aarch64-simd.md (vec_widen_shiftl_lo_,
>   vec_widen_shiftl_hi_): Remove.
>   (aarch64_shll_internal): Renamed to...
>   (aarch64_shll): .. This.
>   (aarch64_shll2_internal): Renamed to...
>   (aarch64_shll2): .. This.
>   (aarch64_shll_n, aarch64_shll2_n): Re-use new
>   optabs.
>   * config/aarch64/constraints.md (D2, DL): New.
>   * config/aarch64/predicates.md (aarch64_simd_shll_imm_vec): New.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/106346
>   * gcc.target/aarch64/pr98772.c: Adjust assembly.
>   * gcc.target/aarch64/vect-widen-shift.c: New test.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> d95394101470446e55f25a2397dd112239b6a54d..f67eb70577d0c2d9911d8c867d38a4d0b390337c
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -6387,105 +6387,67 @@ (define_insn 
> "aarch64_qshl"
>[(set_attr "type" "neon_sat_shift_reg")]
>  )
>  
> -(define_expand "vec_widen_shiftl_lo_"
> -  [(set (match_operand: 0 "register_operand" "=w")
> - (unspec: [(match_operand:VQW 1 "register_operand" "w")
> -  (match_operand:SI 2
> -"aarch64_simd_shift_imm_bitsize_" "i")]
> -  VSHLL))]
> -  "TARGET_SIMD"
> -  {
> -rtx p = aarch64_simd_vect_par_cnst_half (mode, , false);
> -emit_insn (gen_aarch64_shll_internal (operands[0], 
> operands[1],
> -  p, operands[2]));
> -DONE;
> -  }
> -)
> -
> -(define_expand "vec_widen_shiftl_hi_"
> -   [(set (match_operand: 0 "register_operand")
> - (unspec: [(match_operand:VQW 1 "register_operand" "w")
> -  (match_operand:SI 2
> -"immediate_operand" "i")]
> -   VSHLL))]
> -   "TARGET_SIMD"
> -   {
> -rtx p = aarch64_simd_vect_par_cnst_half (mode, , true);
> -emit_insn (gen_aarch64_shll2_internal (operands[0], 
> operands[1],
> -   p, operands[2]));
> -DONE;
> -   }
> -)
> -
>  ;; vshll_n
>  
> -(define_insn "aarch64_shll_internal"
> -  [(set (match_operand: 0 "register_operand" "=w")
> - (unspec: [(vec_select:
> - (match_operand:VQW 1 "register_operand" "w")
> - (match_operand:VQW 2 "vect_par_cnst_lo_half" ""))
> -  (match_operand:SI 3
> -"aarch64_simd_shift_imm_bitsize_" "i")]
> -  VSHLL))]
> +(define_insn "aarch64_shll"
> +  [(set (match_operand: 0 "register_operand")
> + (ashift: (ANY_EXTEND:
> + (match_operand:VD_BHSI 1 "register_operand"))
> +  (match_operand: 2
> +"aarch64_simd_shll_imm_vec")))]
>"TARGET_SIMD"
> -  {
> -if (INTVAL (operands[3]) == GET_MODE_UNIT_BITSIZE (mode))
> -  return "shll\\t%0., %1., %3";
> -else
> -  return "shll\\t%0., %1., %3";
> +  {@ [cons: =0, 1, 2]
> + [w, w, D2] shll\t%0., %1., %I2
> + [w, w, DL] shll\t%0., %1., %I2
>}
>[(set_attr "type" "neon_shift_imm_long")]
>  )
>  
> -(define_insn "aarch64_shll2_internal"
> -  [(set (match_operand: 0 "register_operand" "=w")
> - (unspec: [(vec_select:
> - (match_operand:VQW 1 "register_operand" "w")
> - (match_operand:VQW 2 "vect_par_cnst_hi_half" ""))
> -  (match_operand:SI 3
> -"aarch64_simd_shift_imm_bitsize_" "i")]
> +(define_expand "aarch64_shll_n"
> +  [(set (match_operand: 0 "register_operand")
> + (unspec: [(match_operand:VD_BHSI 1 "register_operand")
> +  (match_operand:SI 2
> +"aarch64_simd_shift_imm_bitsize_")]
>VSHLL))]
>"TARGET_SIMD"
>{
> -if (INTVAL (operands[3]) == GET_MODE_UNIT_BITSIZE (mode))
> -  return "shll2\\t%0., %1., %3";
> -

[PATCH] poly_int: Handle more can_div_trunc_p cases

2023-08-03 Thread Richard Sandiford via Gcc-patches
can_div_trunc_p (a, b, , ) tries to compute a Q and r that
satisfy the usual conditions for truncating division:

 (1) a = b * Q + r
 (2) |b * Q| <= |a|
 (3) |r| < |b|

We can compute Q using the constant component (the case when
all indeterminates are zero).  Since |r| < |b| for the constant
case, the requirements for indeterminate xi with coefficients
ai (for a) and bi (for b) are:

 (2') |bi * Q| <= |ai|
 (3') |ai - bi * Q| <= |bi|

(See the big comment for more details, restrictions, and reasoning).

However, the function works on abstract arithmetic types, and so
it has to be careful not to introduce new overflow.  The code
therefore only handled the extreme for (3'), that is:

 |ai - bi * Q| = |bi|

for the case where Q is zero.

Looking at it again, the overflow issue is a bit easier to handle than
I'd originally thought (or so I hope).  This patch therefore extends the
code to handle |ai - bi * Q| = |bi| for all Q, with Q = 0 no longer
being a separate case.

The net effect is to allow the function to succeed for things like:

 (a0 + b1 (Q+1) x) / (b0 + b1 x)

where Q = a0 / b0, with various sign conditions.  E.g. we now handle:

 (7 + 8x) / (4 + 4x)

with Q = 1 and r = 3 + 4x,

Tested on aarch64-linux-gnu.  OK to install?

Richard


gcc/
* poly-int.h (can_div_trunc_p): Succeed for more boundary conditions.

gcc/testsuite/
* gcc.dg/plugin/poly-int-tests.h (test_can_div_trunc_p_const)
(test_can_div_trunc_p_const): Add more tests.
---
 gcc/poly-int.h   | 45 ++-
 gcc/testsuite/gcc.dg/plugin/poly-int-tests.h | 85 +---
 2 files changed, 98 insertions(+), 32 deletions(-)

diff --git a/gcc/poly-int.h b/gcc/poly-int.h
index 12571455081..7bff5e5ad26 100644
--- a/gcc/poly-int.h
+++ b/gcc/poly-int.h
@@ -2355,28 +2355,31 @@ can_div_trunc_p (const poly_int_pod ,
}
   else
{
- if (q == 0)
-   {
- /* For Q == 0 we simply need: (3') |ai| <= |bi|.  */
- if (a.coeffs[i] != ICa (0))
-   {
- /* Use negative absolute to avoid overflow, i.e.
--|ai| >= -|bi|.  */
- C neg_abs_a = (a.coeffs[i] < 0 ? a.coeffs[i] : -a.coeffs[i]);
- C neg_abs_b = (b.coeffs[i] < 0 ? b.coeffs[i] : -b.coeffs[i]);
- if (neg_abs_a < neg_abs_b)
-   return false;
- rem_p = true;
-   }
-   }
+ /* The only unconditional arithmetic that we can do on ai,
+bi and Q is ai / bi and ai % bi.  (ai == minimum int and
+bi == -1 would be UB in the caller.)  Anything else runs
+the risk of overflow.  */
+ auto qi = NCa (a.coeffs[i]) / NCb (b.coeffs[i]);
+ auto ri = NCa (a.coeffs[i]) % NCb (b.coeffs[i]);
+ /* (2') and (3') are satisfied when ai /[trunc] bi == q.
+So is the stricter condition |ai - bi * Q| < |bi|.  */
+ if (qi == q)
+   rem_p |= (ri != 0);
+ /* The only other case is when:
+
+|bi * Q| + |bi| = |ai| (for (2'))
+and |ai - bi * Q|   = |bi| (for (3'))
+
+The first is equivalent to |bi|(|Q| + 1) == |ai|.
+The second requires ai == bi * (Q + 1) or ai == bi * (Q - 1).  */
+ else if (ri != 0)
+   return false;
+ else if (q <= 0 && qi < q && qi + 1 == q)
+   ;
+ else if (q >= 0 && qi > q && qi - 1 == q)
+   ;
  else
-   {
- /* Otherwise just check for the case in which ai / bi == Q.  */
- if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != q)
-   return false;
- if (NCa (a.coeffs[i]) % NCb (b.coeffs[i]) != 0)
-   rem_p = true;
-   }
+   return false;
}
 }
 
diff --git a/gcc/testsuite/gcc.dg/plugin/poly-int-tests.h 
b/gcc/testsuite/gcc.dg/plugin/poly-int-tests.h
index 0b89acd91cd..7af98595a5e 100644
--- a/gcc/testsuite/gcc.dg/plugin/poly-int-tests.h
+++ b/gcc/testsuite/gcc.dg/plugin/poly-int-tests.h
@@ -1899,14 +1899,19 @@ test_can_div_trunc_p_const ()
ph::make (4, 8, 12),
_quot));
   ASSERT_EQ (const_quot, C (2));
-  ASSERT_EQ (can_div_trunc_p (ph::make (15, 25, 40),
+  ASSERT_TRUE (can_div_trunc_p (ph::make (15, 25, 40),
+   ph::make (4, 8, 10),
+   _quot));
+  ASSERT_EQ (const_quot, C (3));
+  const_quot = 0;
+  ASSERT_EQ (can_div_trunc_p (ph::make (15, 25, 41),
  ph::make (4, 8, 10),
  _quot), N <= 2);
-  ASSERT_EQ (const_quot, C (N <= 2 ? 3 : 2));
+  ASSERT_EQ (const_quot, C (N <= 2 ? 3 : 0));
   ASSERT_EQ (can_div_trunc_p (ph::make (43, 79, 80),
  ph::make (4, 8, 10),
  _quot), N == 1);
-  ASSERT_EQ (const_quot, C (N 

Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

2023-08-03 Thread Richard Sandiford via Gcc-patches
Hao Liu OS  writes:
> Hi Richard,
>
> Update the patch with a simple case (see below case and comments).  It shows 
> a live stmt may not have reduction def, which introduce the ICE.
>
> Is it OK for trunk?

OK, thanks.

Richard

> 
> Fix the assertion failure on empty reduction define in info_for_reduction.
> Even a stmt is live, it may still have empty reduction define.  Check the
> reduction definition instead of live info before calling info_for_reduction.
>
> gcc/ChangeLog:
>
> PR target/110625
> * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
> STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/pr110625_3.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc |  2 +-
>  gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++
>  2 files changed, 35 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d4d76025545..5b8d8fa8e2d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
> stmt_vec_info stmt_info,
>  static bool
>  aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
>  {
> -  if (!STMT_VINFO_LIVE_P (stmt_info))
> +  if (!STMT_VINFO_REDUC_DEF (stmt_info))
>  return false;
>
>auto reduc_info = info_for_reduction (vinfo, stmt_info);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c 
> b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
> new file mode 100644
> index 000..35a50290cb0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mcpu=neoverse-n2" } */
> +
> +/* Avoid ICE on empty reduction def in single_defuse_cycle.
> +
> +   E.g.
> +  [local count: 858993456]:
> + # sum_18 = PHI 
> + sum.0_5 = (unsigned int) sum_18;
> + _6 = _4 + sum.0_5; <-- it is "live" but doesn't have reduction def
> + sum_15 = (int) _6;
> + ...
> + if (ivtmp_29 != 0)
> +   goto ; [75.00%]
> + else
> +   goto ; [25.00%]
> +
> +  [local count: 644245086]:
> + goto ; [100.00%]
> +
> +  [local count: 214748368]:
> + # _31 = PHI <_6(3)>
> + _8 = _31 >> 1;
> +*/
> +
> +int
> +f (unsigned int *tmp)
> +{
> +  int sum = 0;
> +  for (int i = 0; i < 4; i++)
> +sum += tmp[i];
> +
> +  return (unsigned int) sum >> 1;
> +}
> --
> 2.34.1
>
> 
> From: Hao Liu OS 
> Sent: Tuesday, August 1, 2023 17:43
> To: Richard Sandiford
> Cc: Richard Biener; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
> multiplying count [PR110625]
>
> Hi Richard,
>
> This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is 
> true, some reduct stmts still don't have REDUC_DEF.  So I change the check to 
> STMT_VINFO_REDUC_DEF.
>
> Is it OK for trunk?
>
> ---
> Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some 
> reduct stmts
> still don't have definition.
>
> gcc/ChangeLog:
>
> PR target/110625
> * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
> STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d4d76025545..5b8d8fa8e2d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, 
> stmt_vec_info stmt_info,
>  static bool
>  aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
>  {
> -  if (!STMT_VINFO_LIVE_P (stmt_info))
> +  if (!STMT_VINFO_REDUC_DEF (stmt_info))
>  return false;
>
>auto reduc_info = info_for_reduction (vinfo, stmt_info);
> --
> 2.40.0
>
>
> 
> From: Richard Sandiford 
> Sent: Monday, July 31, 2023 17:11
> To: Hao Liu OS
> Cc: Richard Biener; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by 
> multiplying count [PR110625]
>
> Hao Liu OS  writes:
>>> Which test case do you see this for?  The two tests in the patch still
>>> seem to report correct latencies for me if I make the change above.
>>
>> Not the newly added tests.  It is still the existing case causing the 
>> previous ICE (i.e. assertion problem): 
>> gcc.target/aarch64/sve/cost_model_13.c.
>>
>> It's not the test case itself failed, but the dump message of vect says the 
>> "reduction latency" is 0:
>>
>> Before the change:
>> cost_model_13.c:7:21: note:  Original vector body cost = 6
>> cost_model_13.c:7:21: note:  Scalar 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> [...]
>> >> in vect_determine_precisions_from_range.  Maybe we should drop
>> >> the shift handling from there and instead rely on
>> >> vect_determine_precisions_from_users, extending:
>> >> 
>> >>   if (TREE_CODE (shift) != INTEGER_CST
>> >>   || !wi::ltu_p (wi::to_widest (shift), precision))
>> >> return;
>> >> 
>> >> to handle ranges where the max is known to be < precision.
>> >> 
>> >> There again, if masking is enough for right shifts and right rotates,
>> >> maybe we should keep the current handling for then (with your fix)
>> >> and skip the types_compatible_p check for those cases.
>> >
>> > I think it should be enough for left-shifts as well?  If we lshift
>> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
>> > op0 will still make us use HImode.  I think we only ever get overly
>> > conservative answers for left-shifts from this function?
>> 
>> But if we have:
>> 
>>   short x, y;
>>   int z = (int) x << (int) y;
>> 
>> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
>> whereas x << y would invoke UB and x << (y & 15) would be 1.
>
> True, but we start with the range of the LHS which in this case
> would be of type 'int' and thus 1 << 16 and not zero.  You
> might call that a failure of vect_determine_precisions_from_range
> of course, since it makes it not exactly a forward propagation ...

Ah, right, sorry.  I should have done more checking.

> [...]
>> > Originally I completely disabled shift support but that regressed
>> > the over-widen testcases a lot which at least have widened shifts
>> > by constants a lot.
>> >
>> > x86 has vector rotates only for AMD XOP (which is dead) plus
>> > some for V1TImode AFAICS, but I think we pattern-match rotates
>> > to shifts, so maybe the precision stuff is interesting for the
>> > case where we match the pattern rotate sequence for widenings?
>> >
>> > So for the types_compatible_p issue something along
>> > the following?  We could also exempt the shift operand from
>> > being covered by min_precision so the consumer would have
>> > to make sure it can be represented (I think that's never going
>> > to be an issue in practice until we get 256bit integers vectorized).
>> > It will have to fixup the shift operands anyway.
>> >
>> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
>> > index e4ab8c2d65b..cdeeaf98a47 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
>> > (stmt_vec_info stmt_info, gassign *stmt)
>> >   }
>> > else if (TREE_CODE (op) == SSA_NAME)
>> >   {
>> > -   /* Ignore codes that don't take uniform arguments.  */
>> > -   if (!types_compatible_p (TREE_TYPE (op), type))
>> > +   /* Ignore codes that don't take uniform arguments.  For shifts
>> > +  the shift amount is known to be in-range.  */
>> 
>> I guess it's more "we can assume that the amount is in range"?
>
> Yes.
>
>> > +   if (code == LSHIFT_EXPR
>> > +   || code == RSHIFT_EXPR
>> > +   || code == LROTATE_EXPR
>> > +   || code == RROTATE_EXPR)
>> > + {
>> > +   min_value = wi::min (min_value, 0, sign);
>> > +   max_value = wi::max (max_value, TYPE_PRECISION (type), 
>> > sign);
>> 
>> LGTM for shifts right.  Because of the above lshift thing, I think we
>> need something like:
>> 
>>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>> {
>>   wide_int op_min_value, op_max_value;
>>   if (!vect_get_range_info (op, _min_value, op_max_value))
>> return;
>> 
>>   /* We can ignore left shifts by negative amounts, which are UB.  */
>>   min_value = wi::min (min_value, 0, sign);
>> 
>>   /* Make sure the highest non-UB shift amount doesn't become UB.  */
>>   op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>>   auto mask = wi::mask (TYPE_PRECISION (type), false,
>>  op_max_value.to_uhwi ());
>>   max_value = wi::max (max_value, mask, sign);
>> }
>> 
>> Does that look right?
>
> As said it looks overly conservative to me?  For example with my patch
> for
>
> void foo (signed char *v, int s)
> {
>   if (s < 1 || s > 7)
> return;
>   for (int i = 0; i < 1024; ++i)
> v[i] = v[i] << s;
> }
>
> I get
>
> t.c:5:21: note:   _7 has range [0xc000, 0x3f80]
> t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> _6 << s_12(D);
> t.c:5:21: note:   only the low 15 bits of _6 are significant
> t.c:5:21: note:   _6 has range [0xff80, 0x7f]
> ...
> t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
> s_12(D);
> t.c:5:21: note:   demoting int to signed short
> t.c:5:21: note:   Splitting statement: _6 = (int) _5;
> t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
> t.c:5:21: note:   and: patt_23 = (int) 

Re: [PATCH][gensupport]: Don't segfault on empty attrs list

2023-08-02 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> Currently we segfault when len == 0 for an attribute list.
>
> essentially [cons: =0, 1, 2, 3; attrs: ] segfaults but should be equivalent to
> [cons: =0, 1, 2, 3] and [cons: =0, 1, 2, 3; attrs:].  This fixes it by just
> returning early and leaving it to the validators whether this should error out
> or not.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * gensupport.cc (conlist): Support length 0 attribute.
>
> --- inline copy of patch -- 
> diff --git a/gcc/gensupport.cc b/gcc/gensupport.cc
> index 
> 959d1d9c83cf397fcb344e8d3db0f339a967587f..5c5f1cf4781551d3db95103c19cd1b70d98f4f73
>  100644
> --- a/gcc/gensupport.cc
> +++ b/gcc/gensupport.cc
> @@ -619,6 +619,9 @@ public:
>   [ns..ns + len) should equal XSTR (rtx, 0).  */
>conlist (const char *ns, unsigned int len, bool numeric)
>{
> +if (len == 0)
> +  return;
> +
>  /* Trim leading whitespaces.  */
>  while (ISBLANK (*ns))
>{

I think instead we should add some "len" guards to the while loops:

/* Trim leading whitespaces.  */
while (len > 0 && ISBLANK (*ns))
  {
ns++;
len--;
  }

...

/* Parse off any modifiers.  */
while (len > 0 && !ISALNUM (*ns))
  {
con += *(ns++);
len--;
  }

Otherwise we could crash for a string that only contains whitespace,
or that only contains non-alphnumeric characters.

OK like that if it works.

Thanks,
Richard


Re: [PATCH]AArch64 Undo vec_widen_shiftl optabs [PR106346]

2023-08-02 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> In GCC 11 we implemented the vectorizer optab for widening left shifts,
> however this optab is only supported for uniform shift constants.
>
> At the moment GCC still has two loop vectorization strategy (classical loop 
> and
> SLP based loop vec) and the optab is implemented as a scalar pattern.
>
> This means that when we apply it to a non-uniform constant inside a loop we 
> only
> find out during SLP build that the constants aren't uniform.  At this point 
> it's
> too late and we lose SLP entirely.
>
> Over the years I've tried various options but none of it works well:
>
> 1. Dissolving patterns during SLP built (problematic, also dissolves them for
> non-slp).
> 2. Optionally ignoring patterns for SLP build (problematic, ends up 
> interfearing
> with relevancy detection).
> 3. Relaxing contraint on SLP build to allow non-constant values and dissolving
> them after SLP build using an SLP pattern.  (problematic, ends up breaking
> shift reassociation).
>
> As a result we've concluded that for now this pattern should just be removed
> and formed during RTL.
>
> The plan is to move this to an SLP only pattern once we remove classical loop
> vectorization support from GCC, at which time we can also properly support 
> SVE's
> Top and Bottom variants.
>
> This removes the optab and reworks the RTL to recognize both the vector 
> variant
> and the intrinsics variant.  Also just simplifies all these patterns.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR target/106346
>   * config/aarch64/aarch64-simd.md (vec_widen_shiftl_lo_,
>   vec_widen_shiftl_hi_): Remove.
>   (aarch64_shll_internal): Renamed to...
>   (aarch64_shll): .. This.
>   (aarch64_shll2_internal): Renamed to...
>   (aarch64_shll2): .. This.
>   (aarch64_shll_n, aarch64_shll2_n): Re-use new
>   optabs.
>   * config/aarch64/constraints.md (D2, D3): New.
>   * config/aarch64/predicates.md (aarch64_simd_shift_imm_vec): New.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/106346
>   * gcc.target/aarch64/pr98772.c: Adjust assembly.
>   * gcc.target/aarch64/vect-widen-shift.c: New test.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> d95394101470446e55f25a2397dd112239b6a54d..afd5b8632afbcddf8dad14495c3446c560eb085d
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -6387,105 +6387,66 @@ (define_insn 
> "aarch64_qshl"
>[(set_attr "type" "neon_sat_shift_reg")]
>  )
>  
> -(define_expand "vec_widen_shiftl_lo_"
> -  [(set (match_operand: 0 "register_operand" "=w")
> - (unspec: [(match_operand:VQW 1 "register_operand" "w")
> -  (match_operand:SI 2
> -"aarch64_simd_shift_imm_bitsize_" "i")]
> -  VSHLL))]
> -  "TARGET_SIMD"
> -  {
> -rtx p = aarch64_simd_vect_par_cnst_half (mode, , false);
> -emit_insn (gen_aarch64_shll_internal (operands[0], 
> operands[1],
> -  p, operands[2]));
> -DONE;
> -  }
> -)
> -
> -(define_expand "vec_widen_shiftl_hi_"
> -   [(set (match_operand: 0 "register_operand")
> - (unspec: [(match_operand:VQW 1 "register_operand" "w")
> -  (match_operand:SI 2
> -"immediate_operand" "i")]
> -   VSHLL))]
> -   "TARGET_SIMD"
> -   {
> -rtx p = aarch64_simd_vect_par_cnst_half (mode, , true);
> -emit_insn (gen_aarch64_shll2_internal (operands[0], 
> operands[1],
> -   p, operands[2]));
> -DONE;
> -   }
> -)
> -
>  ;; vshll_n
>  
> -(define_insn "aarch64_shll_internal"
> -  [(set (match_operand: 0 "register_operand" "=w")
> - (unspec: [(vec_select:
> - (match_operand:VQW 1 "register_operand" "w")
> - (match_operand:VQW 2 "vect_par_cnst_lo_half" ""))
> -  (match_operand:SI 3
> -"aarch64_simd_shift_imm_bitsize_" "i")]
> -  VSHLL))]
> +(define_insn "aarch64_shll"
> +  [(set (match_operand: 0 "register_operand")
> + (ashift: (ANY_EXTEND:
> + (match_operand:VD_BHSI 1 "register_operand"))
> +  (match_operand: 2
> +"aarch64_simd_shift_imm_vec")))]

The name of this predicate seems more general than its meaning.
How about naming it aarch64_simd_shift_imm_vec_half_bitsize, to follow:

;; Predicates used by the various SIMD shift operations.  These
;; fall in to 3 categories.
;;   Shifts with a range 0-(bit_size - 1) (aarch64_simd_shift_imm)
;;   Shifts with a range 1-bit_size (aarch64_simd_shift_imm_offset)
;;   Shifts with a range 0-bit_size (aarch64_simd_shift_imm_bitsize)

Or 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 1 Aug 2023, Richard Sandiford wrote:
>
>> Richard Sandiford  writes:
>> > Richard Biener via Gcc-patches  writes:
>> >> The following makes sure to limit the shift operand when vectorizing
>> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> >> operand otherwise invokes undefined behavior.  When we determine
>> >> whether we can demote the operand we know we at most shift in the
>> >> sign bit so we can adjust the shift amount.
>> >>
>> >> Note this has the possibility of un-CSEing common shift operands
>> >> as there's no good way to share pattern stmts between patterns.
>> >> We'd have to separately pattern recognize the definition.
>> >>
>> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >>
>> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> >> the fallback optimistic zero for out-of-range shifts is at least
>> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
>> >
>> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
>> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
>
> I meant "correct" for the constant folding that evaluates out-of-bound
> shifts as zero.
>
>> Sorry for the multiple messages.  Wanted to get something out quickly
>> because I wasn't sure how long it would take me to write this...
>> 
>> On rotates, for:
>> 
>> void
>> foo (unsigned short *restrict ptr)
>> {
>>   for (int i = 0; i < 200; ++i)
>> {
>>   unsigned int x = ptr[i] & 0xff0;
>>   ptr[i] = (x << 1) | (x >> 31);
>> }
>> }
>> 
>> we do get:
>> 
>> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
>> 
>> although aarch64 doesn't provide rrotate patterns, so nothing actually
>> comes of it.
>
> I think it's still correct that we only need unsigned:13 for the input,
> we know other bits are zero.  But of course when actually applying
> this as documented
>
> /* Record that STMT_INFO could be changed from operating on TYPE to
>operating on a type with the precision and sign given by PRECISION
>and SIGN respectively.
>
> the operation itself has to be altered (the above doesn't suggest
> promoting/demoting the operands to TYPE is the only thing to do).
>
> So it seems to be the burden is on the consumers of the information?

Yeah, textually that seems fair.  Not sure I was thinking of it in
those terms at the time though. :)

>> I think the handling of variable shifts is flawed for other reasons.  Given:
>> 
>> void
>> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> us (unsigned short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> su (short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> ss (short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> we only narrow uu and ss, due to:
>> 
>>  /* Ignore codes that don't take uniform arguments.  */
>>  if (!types_compatible_p (TREE_TYPE (op), type))
>>return;
>
> I suppose that's because we care about the shift operand at all here.
> We could possibly use [0 .. precision-1] as known range for it
> and only if that doesn't fit 'type' give up (and otherwise simply
> ignore the input range of the shift operands here).
>
>> in vect_determine_precisions_from_range.  Maybe we should drop
>> the shift handling from there and instead rely on
>> vect_determine_precisions_from_users, extending:
>> 
>>  if (TREE_CODE (shift) != INTEGER_CST
>>  || !wi::ltu_p (wi::to_widest (shift), precision))
>>return;
>> 
>> to handle ranges where the max is known to be < precision.
>> 
>> There again, if masking is enough for right shifts and right rotates,
>> maybe we should keep the current handling for then (with your fix)
>> and skip the types_compatible_p check for those cases.
>
> I think it should be enough for left-shifts as well?  If we lshift
> out like 0x100 << 9 so the lhs range is [0,0] the input range from
> op0 will still make us use HImode.  I think we only ever get overly
> conservative answers for left-shifts from this function?

But if we have:

  short x, y;
  int z = (int) x << (int) y;

and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
whereas x << y would invoke UB and x << (y & 15) would be 1.

> Whatever works for RROTATE should also work for LROTATE.

I think the same problem affects LROTATE.

>> So:
>> 
>> - restrict shift handling in vect_determine_precisions_from_range to
>>   RSHIFT_EXPR and RROTATE_EXPR
>> 
>> - remove types_compatible_p restriction for those cases
>> 
>> - extend vect_determine_precisions_from_users shift handling to check
>>   for ranges on the 

Re: [PATCH]AArch64 update costing for MLA by invariant

2023-08-02 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> Tamar Christina  writes:
>> > Hi All,
>> >
>> > When determining issue rates we currently discount non-constant MLA
>> > accumulators for Advanced SIMD but don't do it for the latency.
>> >
>> > This means the costs for Advanced SIMD with a constant accumulator are
>> > wrong and results in us costing SVE and Advanced SIMD the same.  This
>> > can cauze us to vectorize with Advanced SIMD instead of SVE in some cases.
>> >
>> > This patch adds the same discount for SVE and Scalar as we do for issue 
>> > rate.
>> >
>> > My assumption was that on issue rate we reject all scalar constants
>> > early because we take into account the extra instruction to create the
>> constant?
>> > Though I'd have expected this to be in prologue costs.  For this
>> > reason I added an extra parameter to allow me to force the check to at
>> > least look for the multiplication.
>> 
>> I'm not sure that was it.  I wish I'd added a comment to say what it was
>> though :(  I suspect different parts of this function were written at 
>> different
>> times, hence the inconsistency.
>> 
>> > This gives a 5% improvement in fotonik3d_r in SPECCPU 2017 on large
>> > Neoverse cores.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> >* config/aarch64/aarch64.cc (aarch64_multiply_add_p): Add param
>> >allow_constants.
>> >(aarch64_adjust_stmt_cost): Use it.
>> >(aarch64_vector_costs::count_ops): Likewise.
>> >(aarch64_vector_costs::add_stmt_cost): Pass vinfo to
>> >aarch64_adjust_stmt_cost.
>> >
>> > --- inline copy of patch --
>> > diff --git a/gcc/config/aarch64/aarch64.cc
>> > b/gcc/config/aarch64/aarch64.cc index
>> >
>> 560e5431636ef46c41d56faa0c4e95be78f64b50..76b74b77b3f122a3c9725
>> 57e2f83
>> > b63ba365fea9 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -16398,10 +16398,11 @@ aarch64_advsimd_ldp_stp_p (enum
>> vect_cost_for_stmt kind,
>> > or multiply-subtract sequence that might be suitable for fusing into a
>> > single instruction.  If VEC_FLAGS is zero, analyze the operation as
>> > a scalar one, otherwise analyze it as an operation on vectors with 
>> > those
>> > -   VEC_* flags.  */
>> > +   VEC_* flags.  When ALLOW_CONSTANTS we'll recognize all accumulators
>> including
>> > +   constant ones.  */
>> >  static bool
>> >  aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
>> > -  unsigned int vec_flags)
>> > +  unsigned int vec_flags, bool allow_constants)
>> >  {
>> >gassign *assign = dyn_cast (stmt_info->stmt);
>> >if (!assign)
>> > @@ -16410,8 +16411,9 @@ aarch64_multiply_add_p (vec_info *vinfo,
>> stmt_vec_info stmt_info,
>> >if (code != PLUS_EXPR && code != MINUS_EXPR)
>> >  return false;
>> >
>> > -  if (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
>> > -  || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign)))
>> > +  if (!allow_constants
>> > +  && (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
>> > +|| CONSTANT_CLASS_P (gimple_assign_rhs2 (assign
>> >  return false;
>> >
>> >for (int i = 1; i < 3; ++i)
>> > @@ -16429,7 +16431,7 @@ aarch64_multiply_add_p (vec_info *vinfo,
>> stmt_vec_info stmt_info,
>> >if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>> >continue;
>> >
>> > -  if (vec_flags & VEC_ADVSIMD)
>> > +  if (!allow_constants && (vec_flags & VEC_ADVSIMD))
>> >{
>> >  /* Scalar and SVE code can tie the result to any FMLA input (or none,
>> > although that requires a MOVPRFX for SVE).  However, Advanced
>> > SIMD @@ -16441,7 +16443,8 @@ aarch64_multiply_add_p (vec_info
>> *vinfo, stmt_vec_info stmt_info,
>> >return false;
>> >  def_stmt_info = vinfo->lookup_def (rhs);
>> >  if (!def_stmt_info
>> > -|| STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
>> > +|| STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def
>> > +|| STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_constant_def)
>> 
>> Do you see vect_constant_defs in practice, or is this just for completeness?
>> I would expect any constants to appear as direct operands.  I don't mind
>> keeping it if it's just a belt-and-braces thing though.
>
> In the latency case where I had allow_constants the early rejection based on
> the operand itself wouldn't be rejected so in that case I still needed to 
> reject
> them but do so after the multiply check.  While they do appear as direct
> operands as well they also have their own nodes, in particular for SLP so the
> constants are handled as a group.

Ah, OK, thanks.

> But can also check CONSTANT_CLASS_P (rhs) if that's preferrable. 

No, what you did is more correct.  I just wasn't sure at first which case
it was handling.

Thanks,
Richard


Re: [PATCH]AArch64 update costing for combining vector conditionals

2023-08-02 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> boolean comparisons have different cost depending on the mode. e.g.
> a && b when predicated doesn't require an addition instruction, the AND is 
> free

Nit (for the commit msg): additional

Maybe:

  for SVE, a && b doesn't require an additional instruction when a or b
  is predicated, ...

?

> by combining the predicate of the one operation into the second one.  At the
> moment though we only fuse compares so this update requires one of the
> operands to be a comparison.
>
> Scalars also don't require this because the non-ifct variant is a series of

Typo: ifcvt

> branches where following the branch sequences themselves are natural ANDs.
>
> Advanced SIMD however does require an actual AND to combine the boolean 
> values.
>
> As such this patch discounts Scalar and SVE boolean operation latency and
> throughput.
>
> With this patch comparison heavy code prefers SVE as it should, especially in
> cases with SVE VL == Advanced SIMD VL where previously the SVE prologue costs
> would tip it towards Advanced SIMD.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_bool_compound_p): New.
>   (aarch64_adjust_stmt_cost, aarch64_vector_costs::count_ops): Use it.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> b1bacc734b4630257b6ebf8ca7d9afeb34008c10..55963bb28be7ede08b05fb9fddb5a65f6818c63e
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16453,6 +16453,49 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>return false;
>  }
>  
> +/* Return true if STMT_INFO is the second part of a two-statement boolean AND
> +   expression sequence that might be suitable for fusing into a
> +   single instruction.  If VEC_FLAGS is zero, analyze the operation as
> +   a scalar one, otherwise analyze it as an operation on vectors with those
> +   VEC_* flags.  */
> +
> +static bool
> +aarch64_bool_compound_p (vec_info *vinfo, stmt_vec_info stmt_info,
> +  unsigned int vec_flags)
> +{
> +  gassign *assign = dyn_cast (stmt_info->stmt);
> +  if (!assign
> +  || !STMT_VINFO_VECTYPE (stmt_info)
> +  || !VECTOR_BOOLEAN_TYPE_P (STMT_VINFO_VECTYPE (stmt_info))
> +  || gimple_assign_rhs_code (assign) != BIT_AND_EXPR)

Very minor, sorry, but I think the condition reads more naturally
if the BIT_AND_EXPR test comes immediately after the !assign.

OK with that change, thanks.

Richard

> +return false;
> +
> +  for (int i = 1; i < 3; ++i)
> +{
> +  tree rhs = gimple_op (assign, i);
> +
> +  if (TREE_CODE (rhs) != SSA_NAME)
> + continue;
> +
> +  stmt_vec_info def_stmt_info = vinfo->lookup_def (rhs);
> +  if (!def_stmt_info
> +   || STMT_VINFO_DEF_TYPE (def_stmt_info) != vect_internal_def)
> + continue;
> +
> +  gassign *rhs_assign = dyn_cast (def_stmt_info->stmt);
> +  if (!rhs_assign
> +   || TREE_CODE_CLASS (gimple_assign_rhs_code (rhs_assign))
> + != tcc_comparison)
> + continue;
> +
> +  if (vec_flags & VEC_ADVSIMD)
> + return false;
> +
> +  return true;
> +}
> +  return false;
> +}
> +
>  /* We are considering implementing STMT_INFO using SVE.  If STMT_INFO is an
> in-loop reduction that SVE supports directly, return its latency in 
> cycles,
> otherwise return zero.  SVE_COSTS specifies the latencies of the relevant
> @@ -16750,11 +16793,17 @@ aarch64_adjust_stmt_cost (vec_info *vinfo, 
> vect_cost_for_stmt kind,
>   }
>  
>gassign *assign = dyn_cast (STMT_VINFO_STMT (stmt_info));
> -  if (assign && !vect_is_reduction (stmt_info))
> +  if (assign)
>   {
> bool simd_p = vec_flags & VEC_ADVSIMD;
> /* For MLA we need to reduce the cost since MLA is 1 instruction.  */
> -   if (aarch64_multiply_add_p (vinfo, stmt_info, vec_flags, !simd_p))
> +   if (!vect_is_reduction (stmt_info)
> +   && aarch64_multiply_add_p (vinfo, stmt_info, vec_flags, !simd_p))
> + return 0;
> +
> +   /* For vector boolean ANDs with a compare operand we just need
> +  one insn.  */
> +   if (aarch64_bool_compound_p (vinfo, stmt_info, vec_flags))
>   return 0;
>   }
>  
> @@ -16831,6 +16880,12 @@ aarch64_vector_costs::count_ops (unsigned int count, 
> vect_cost_for_stmt kind,
>&& aarch64_multiply_add_p (m_vinfo, stmt_info, m_vec_flags, false))
>  return;
>  
> +  /* Assume that bool AND with compare operands will become a single
> + operation.  */
> +  if (stmt_info
> +  && aarch64_bool_compound_p (m_vinfo, stmt_info, m_vec_flags))
> +return;
> +
>/* Count the basic operation cost associated with KIND.  */
>switch (kind)
>  {


Re: [PATCH]AArch64 update costing for MLA by invariant

2023-08-02 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> When determining issue rates we currently discount non-constant MLA 
> accumulators
> for Advanced SIMD but don't do it for the latency.
>
> This means the costs for Advanced SIMD with a constant accumulator are wrong 
> and
> results in us costing SVE and Advanced SIMD the same.  This can cauze us to
> vectorize with Advanced SIMD instead of SVE in some cases.
>
> This patch adds the same discount for SVE and Scalar as we do for issue rate.
>
> My assumption was that on issue rate we reject all scalar constants early
> because we take into account the extra instruction to create the constant?
> Though I'd have expected this to be in prologue costs.  For this reason I 
> added
> an extra parameter to allow me to force the check to at least look for the
> multiplication.

I'm not sure that was it.  I wish I'd added a comment to say what
it was though :(  I suspect different parts of this function were
written at different times, hence the inconsistency.

> This gives a 5% improvement in fotonik3d_r in SPECCPU 2017 on large
> Neoverse cores.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_multiply_add_p): Add param
>   allow_constants. 
>   (aarch64_adjust_stmt_cost): Use it.
>   (aarch64_vector_costs::count_ops): Likewise.
>   (aarch64_vector_costs::add_stmt_cost): Pass vinfo to
>   aarch64_adjust_stmt_cost.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 560e5431636ef46c41d56faa0c4e95be78f64b50..76b74b77b3f122a3c972557e2f83b63ba365fea9
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16398,10 +16398,11 @@ aarch64_advsimd_ldp_stp_p (enum vect_cost_for_stmt 
> kind,
> or multiply-subtract sequence that might be suitable for fusing into a
> single instruction.  If VEC_FLAGS is zero, analyze the operation as
> a scalar one, otherwise analyze it as an operation on vectors with those
> -   VEC_* flags.  */
> +   VEC_* flags.  When ALLOW_CONSTANTS we'll recognize all accumulators 
> including
> +   constant ones.  */
>  static bool
>  aarch64_multiply_add_p (vec_info *vinfo, stmt_vec_info stmt_info,
> - unsigned int vec_flags)
> + unsigned int vec_flags, bool allow_constants)
>  {
>gassign *assign = dyn_cast (stmt_info->stmt);
>if (!assign)
> @@ -16410,8 +16411,9 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>if (code != PLUS_EXPR && code != MINUS_EXPR)
>  return false;
>  
> -  if (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
> -  || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign)))
> +  if (!allow_constants
> +  && (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
> +   || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign
>  return false;
>  
>for (int i = 1; i < 3; ++i)
> @@ -16429,7 +16431,7 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>if (!rhs_assign || gimple_assign_rhs_code (rhs_assign) != MULT_EXPR)
>   continue;
>  
> -  if (vec_flags & VEC_ADVSIMD)
> +  if (!allow_constants && (vec_flags & VEC_ADVSIMD))
>   {
> /* Scalar and SVE code can tie the result to any FMLA input (or none,
>although that requires a MOVPRFX for SVE).  However, Advanced SIMD
> @@ -16441,7 +16443,8 @@ aarch64_multiply_add_p (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>   return false;
> def_stmt_info = vinfo->lookup_def (rhs);
> if (!def_stmt_info
> -   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def)
> +   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_external_def
> +   || STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_constant_def)

Do you see vect_constant_defs in practice, or is this just for completeness?
I would expect any constants to appear as direct operands.  I don't mind
keeping it if it's just a belt-and-braces thing though.

But rather than add the allow_constants parameter, I think we should
just try removing:

  if (CONSTANT_CLASS_P (gimple_assign_rhs1 (assign))
  || CONSTANT_CLASS_P (gimple_assign_rhs2 (assign)))
return false;

so that the detection is the same for throughput and latency.  I think:

  if (vec_flags & VEC_ADVSIMD)
{
  /* Scalar and SVE code can tie the result to any FMLA input (or none,
 although that requires a MOVPRFX for SVE).  However, Advanced SIMD
 only supports MLA forms, so will require a move if the result
 cannot be tied to the accumulator.  The most important case in
 which this is true is when the accumulator input is invariant.  */
  rhs = gimple_op (assign, 3 - i);
  if (TREE_CODE (rhs) != SSA_NAME)
return false;
  def_stmt_info = 

Re: [PATCH 2/5] [RISC-V] Generate Zicond instruction for basic semantics

2023-08-02 Thread Richard Sandiford via Gcc-patches
Jeff Law via Gcc-patches  writes:
> On 8/1/23 05:18, Richard Sandiford wrote:
>> 
>> Where were you seeing the requirement for pointer equality?  genrecog.cc
>> at least uses rtx_equal_p, and I think it has to.  E.g. some patterns
>> use (match_dup ...) to match output and input mems, and mem rtxes
>> shouldn't be shared.
> It's a general concern due to the way we handle transforming pseudos 
> into hard registers after allocation is complete.   We can end up with 
> two REG expressions that will compare equal according to rtx_equal_p, 
> but which are not pointer equal.

But isn't that OK?  I don't think there's a requirement for match_dup
pointer equality either before or after RA.  Or at least, there
shouldn't be.  If something happens to rely on pointer equality
for match_dups then I think we should fix it.

So IMO, like you said originally, match_dup would be the right way to
handle this kind of pattern.

The reason I'm interested is that AArch64 makes pretty extensive use
of match_dup for this purpose.  E.g.:

(define_insn "aarch64_abd"
  [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
(minus:VDQ_BHSI
  (USMAX:VDQ_BHSI
(match_operand:VDQ_BHSI 1 "register_operand" "w")
(match_operand:VDQ_BHSI 2 "register_operand" "w"))
  (:VDQ_BHSI
(match_dup 1)
(match_dup 2]

So if this isn't working correctly for subregs (or for anythine else),
then I'd be keen to do something about it :)

I don't want to labour the point though.

Thanks,
Richard


<    1   2   3   4   5   6   7   8   9   10   >