Re: [V2 PATCH] Handle bitop with INTEGER_CST in analyze_and_compute_bitop_with_inv_effect.

2023-11-12 Thread Hongtao Liu
On Fri, Nov 10, 2023 at 5:12 PM Richard Biener
 wrote:
>
> On Wed, Nov 8, 2023 at 9:22 AM Hongtao Liu  wrote:
> >
> > On Wed, Nov 8, 2023 at 3:53 PM Richard Biener
> >  wrote:
> > >
> > > On Wed, Nov 8, 2023 at 2:18 AM Hongtao Liu  wrote:
> > > >
> > > > On Tue, Nov 7, 2023 at 10:34 PM Richard Biener
> > > >  wrote:
> > > > >
> > > > > On Tue, Nov 7, 2023 at 2:03 PM Hongtao Liu  wrote:
> > > > > >
> > > > > > On Tue, Nov 7, 2023 at 4:10 PM Richard Biener
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Nov 7, 2023 at 7:08 AM liuhongt  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > analyze_and_compute_bitop_with_inv_effect assumes the first 
> > > > > > > > operand is
> > > > > > > > loop invariant which is not the case when it's INTEGER_CST.
> > > > > > > >
> > > > > > > > Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}.
> > > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > So this addresses a missed optimization, right?  It seems to me 
> > > > > > > that
> > > > > > > even with two SSA names we are only "lucky" when rhs1 is the 
> > > > > > > invariant
> > > > > > > one.  So instead of swapping this way I'd do
> > > > > > Yes, it's a miss optimization.
> > > > > > And I think expr_invariant_in_loop_p (loop, match_op[1]) should be
> > > > > > enough, if match_op[1] is a loop invariant.it must be false for the
> > > > > > below conditions(there couldn't be any header_phi from its
> > > > > > definition).
> > > > >
> > > > > Yes, all I said is that when you now care for op1 being INTEGER_CST
> > > > > it could also be an invariant SSA name and thus only after swapping 
> > > > > op0/op1
> > > > > we could have a successful match, no?
> > > > Sorry, the commit message is a little bit misleading.
> > > > At first, I just wanted to handle the INTEGER_CST case (with TREE_CODE
> > > > (match_op[1]) == INTEGER_CST), but then I realized that this could
> > > > probably be extended to the normal SSA_NAME case as well, so I used
> > > > expr_invariant_in_loop_p, which should theoretically be able to handle
> > > > the SSA_NAME case as well.
> > > >
> > > > if (expr_invariant_in_loop_p (loop, match_op[1])) is true, w/o
> > > > swapping it must return NULL_TREE for below conditions.
> > > > if (expr_invariant_in_loop_p (loop, match_op[1])) is false, w/
> > > > swapping it must return NULL_TREE too.
> > > > So it can cover the both cases you mentioned, no need for a loop to
> > > > iterate 2 match_ops for all conditions.
> > >
> > > Sorry if it appears we're going in circles ;)
> > >
> > > > 3692  if (TREE_CODE (match_op[1]) != SSA_NAME
> > > > 3693  || !expr_invariant_in_loop_p (loop, match_op[0])
> > > > 3694  || !(header_phi = dyn_cast  (SSA_NAME_DEF_STMT 
> > > > (match_op[1])))
> > >
> > > but this only checks match_op[1] (an SSA name at this point) for being 
> > > defined
> > > by the header PHI.  What if expr_invariant_in_loop_p (loop, mach_op[1])
> > > and header_phi = dyn_cast  (SSA_NAME_DEF_STMT (match_op[0]))
> > > which I think can happen when both ops are SSA name?
> > The whole condition is like
> >
> > 3692  if (TREE_CODE (match_op[1]) != SSA_NAME
> > 3693  || !expr_invariant_in_loop_p (loop, match_op[0])
> > 3694  || !(header_phi = dyn_cast  (SSA_NAME_DEF_STMT 
> > (match_op[1])))
> > 3695  || gimple_bb (header_phi) != loop->header  - This would
> > be true if match_op[1] is SSA_NAME and expr_invariant_in_loop_p
>
> But it could be expr_invariant_in_loop_p (match_op[1]) and
> header_phi = dyn_cast  (SSA_NAME_DEF_STMT (match_op[0]))

> > > > > > > > +  if (expr_invariant_in_loop_p (loop, match_op[1]))
> > > > > > > > +std::swap (match_op[0], match_op[1]);
match_op[1] will be swapped to match_op[0], the case is also handled
by my patch [1](the v2 patch)
My point is the upper code already handles 2 SSA names, no need to
iterate with all conditions, expr_invariant_in_loop_p alone is enough.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635440.html
>
> all I say is that for two SSA names we could not match the condition
> (aka not fail)
> when we swap op0/op1.  Not only when op1 is INTEGER_CST.

>
> > 3696  || gimple_phi_num_args (header_phi) != 2)
> >
> > If expr_invariant_in_loop_p (loop, mach_op[1]) is true and it's an SSA_NAME
> > according to code in expr_invariant_in_loop_p, def_bb of gphi is
> > either NULL or not belong to this loop, either case will make will
> > make gimple_bb (header_phi) != loop->header true.
> >
> > 1857  if (TREE_CODE (expr) == SSA_NAME)
> > 1858{
> > 1859  def_bb = gimple_bb (SSA_NAME_DEF_STMT (expr));
> > 1860  if (def_bb
> > 1861  && flow_bb_inside_loop_p (loop, def_bb))  -- def_bb is
> > NULL or it doesn't belong to the loop
> > 1862return false;
> > 1863
> > 1864  return true;
> > 1865}
> > 1866
> > 1867  if (!EXPR_P (expr))
> >
> > >
> > > The only canonicalization we have is that constant operands are put 
> > > second so
> > > it would have been more natural to

Re: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x)) simplification [PR112483]

2023-11-12 Thread Andrew Pinski
On Sun, Nov 12, 2023, 23:10 Tamar Christina  wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Monday, November 13, 2023 6:55 AM
> > To: Xi Ruoyao 
> > Cc: gcc-patches@gcc.gnu.org; chenglulu ;
> > i...@xen0n.name; xucheng...@loongson.cn; Tamar Christina
> > ; tschwi...@gcc.gnu.org; Roger Sayle
> > 
> > Subject: Re: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x))
> > simplification [PR112483]
> >
> > On Sun, Nov 12, 2023 at 9:27 PM Xi Ruoyao  wrote:
> > >
> > > (fcopysign x, NEGATIVE_CONST) can be simplified to (fneg (fabs x)),
> > > but a logic error in the code caused it mistakenly simplified to (fneg
> > > x) instead.
>
> The fix aside, I actually wonder if simplify-rtx.cc should be doing this
> at all.
> The mid-end didn't do it because the target said it had an optab for the
> copysign operation.  Otherwise during expand_COPYSIGN it would have been
> expanded as FNEG (FABS (..)) already.
>
> In the case of e.g. longaarch64 It looks like the target actually has an
> fcopysign
> Instruction, so wouldn't this rewriting by simplify-rtx be a
> de-optimization?
>

Maybe the simplify_gen_unary under the if statement should really
be simplify_unary_operation.
This allows for constant folding but not generating a non canonical form
here.
Also note Canonical RTL forms have a section in the internals document too.
The gimple level Canonical forms are not documented yet; I started writing
some of it on the wiki though: https://gcc.gnu.org/wiki/GimpleCanonical .

Thanks,
Andrew Pinski



> Thanks,
> Tamar
> >
> > OK.
> >
> > > gcc/ChangeLog:
> > >
> > > PR rtl-optimization/112483
> > > * simplify-rtx.cc (simplify_binary_operation_1)  COPYSIGN>:
> > > Fix the simplification of (fcopysign x, NEGATIVE_CONST).
> > > ---
> > >
> > > Bootstrapped and regtested on loongarch64-linux-gnu and
> > > x86_64-linux-gnu.  Ok for trunk?
> > >
> > >  gcc/simplify-rtx.cc | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index
> > > 69d87579d9c..2d2e5a3c1ca 100644
> > > --- a/gcc/simplify-rtx.cc
> > > +++ b/gcc/simplify-rtx.cc
> > > @@ -4392,7 +4392,7 @@ simplify_ashift:
> > >   real_convert (&f1, mode, CONST_DOUBLE_REAL_VALUE (trueop1));
> > >   rtx tmp = simplify_gen_unary (ABS, mode, op0, mode);
> > >   if (REAL_VALUE_NEGATIVE (f1))
> > > -   tmp = simplify_gen_unary (NEG, mode, op0, mode);
> > > +   tmp = simplify_gen_unary (NEG, mode, tmp, mode);
> > >   return tmp;
> > > }
> > >if (GET_CODE (op0) == NEG || GET_CODE (op0) == ABS)
> > > --
> > > 2.42.1
> > >
>


Re: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x)) simplification [PR112483]

2023-11-12 Thread Xi Ruoyao
On Mon, 2023-11-13 at 07:09 +, Tamar Christina wrote:
> In the case of e.g. longaarch64 It looks like the target actually has an 
> fcopysign
> Instruction, so wouldn't this rewriting by simplify-rtx be a de-optimization?

Yes it seems a de-optimization on LoongArch.  For this micro-benchmark:

int main()
{
#pragma GCC unroll(100)
for (int i = 0; i < 10; i++) {
float x = -1, a = 1.23456;
asm volatile ("":"+f"(a));
#ifdef DISALLOW_COPYSIGN_OPTIMIZATION
asm("":"+f"(x));
#endif
a = __builtin_copysignf(a, x);
asm(""::"f"(a));
}
}

If DISALLOW_COPYSIGN_OPTIMIZATION is defined, the result is faster for
0.23 seconds.

I'll submit another patch to disable this.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCH] Simplify vector ((VCE?(a cmp b ? -1 : 0)) < 0) ? c : d to just VCE:((a cmp b) ? (VCE c) : (VCE d)).

2023-11-12 Thread Hongtao Liu
On Fri, Nov 10, 2023 at 2:14 PM liuhongt  wrote:
>
> When I'm working on PR112443, I notice there's some misoptimizations:
> after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
> fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is
> too complicated, so I think maybe we should hanlde it in the gimple
> level.
>
> The dump is like
>
>   _1 = c_3(D) >= { 0, 0, 0, 0 };
>   _2 = VEC_COND_EXPR <_1, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }>;
>   _7 = VIEW_CONVERT_EXPR(_2);
>   _8 = VIEW_CONVERT_EXPR(b_6(D));
>   _9 = VIEW_CONVERT_EXPR(a_5(D));
>   _10 = _7 < { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
>   _11 = VEC_COND_EXPR <_10, _8, _9>;
>
> It can be optimized to
>
>   _1 = c_2(D) >= { 0, 0, 0, 0 };
>   _6 = VEC_COND_EXPR <_1, b_5(D), a_4(D)>;
>
> since _7 is either -1 or 0, the selection of _7 < 0 ? _8 : _9 should
> be euqal to _1 ? b : a as long as TYPE_PRECISION of the component type
> of the second VEC_COND_EXPR is less equal to the first one.
> The patch add a gimple pattern to handle that.

The is the updated patch according to pinski's comments, I'll reply here.
> It looks like the outer vec_cond isn't actually relevant to the 
> simplification?
>

My original pattern is wrong as pinski mentioned, for the new pattern
outer vec_cond is needed
> Actually this is invalid transformation. It is only valid for unsigned types.
> The reason why it is invalid is because the sign bit changes when
> going to a smaller type from a larger one.
> It would be valid for equals but no other type.

>  (lt (view_convert? (vec_cond (cmp @0 @1) integer_all_onesp
> integer_zerop)) integer_zerop)
>
> is the relevant part?  I wonder what canonicalizes the inner vec_cond?
>  Did you ever see
> the (view_convert ... missing?

typedef char v32qi __attribute__((vector_size(16)));

v32qi
foo (v32qi a, v32qi b, v32qi c)
{
v32qi d = ~c < 0 ?
__extension__(v32qi){-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1}
: (v32qi){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
return d < 0 ? a : b;
}
Looks like ccp1 can handle non view_convert case, so I'll remove "?"
in view_convert.
>
> gcc/ChangeLog:
>
> * match.pd (VCE:(a cmp b ? -1 : 0) < 0) ? c : d ---> VCE:((a
> cmp b) ? (VCE:c) : (VCE:d)): New gimple simplication.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512vl-blendv-3.c: New test.
> * gcc.target/i386/blendv-3.c: New test.
> ---
>  gcc/match.pd  | 19 
>  .../gcc.target/i386/avx512vl-blendv-3.c   |  6 +++
>  gcc/testsuite/gcc.target/i386/blendv-3.c  | 46 +++
>  3 files changed, 71 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/blendv-3.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index dbc811b2b38..4d823882a7c 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5170,6 +5170,25 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
>(vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
>
> +(for cmp (simple_comparison)
> + (simplify
> +  (vec_cond
> +(lt (view_convert?@5 (vec_cond@6 (cmp@4 @0 @1)
> +integer_all_onesp
> +integer_zerop))
> + integer_zerop) @2 @3)
> +  (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@0))
> +   && VECTOR_INTEGER_TYPE_P (TREE_TYPE (@5))
> +   && !TYPE_UNSIGNED (TREE_TYPE (@5))
> +   && VECTOR_TYPE_P (TREE_TYPE (@6))
> +   && VECTOR_TYPE_P (type)
> +   && (TYPE_PRECISION (TREE_TYPE (type))
> + <= TYPE_PRECISION (TREE_TYPE (TREE_TYPE (@6
> +   && TYPE_SIZE (type) == TYPE_SIZE (TREE_TYPE (@6)))
> +   (with { tree vtype = TREE_TYPE (@6);}
> + (view_convert:type
> +   (vec_cond @4 (view_convert:vtype @2) (view_convert:vtype @3)))
> +
>  /* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
>  (simplify
>   (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c 
> b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> new file mode 100644
> index 000..2777e72ab5f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512vl -mavx512bw -O2" } */
> +/* { dg-final { scan-assembler-times {vp?blendv(?:b|p[sd])[ \t]*} 6 } } */
> +/* { dg-final { scan-assembler-not {vpcmp} } } */
> +
> +#include "blendv-3.c"
> diff --git a/gcc/testsuite/gcc.target/i386/blendv-3.c 
> b/gcc/testsuite/gcc.target/i386/blendv-3.c
> new file mode 100644
> index 000..fa0fb067a73
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/blendv-3.c
> @@ -0,0 +1,46 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times {vp?blendv(?:b|p[sd])[ \t]*} 6 } } */
> +/* { dg-final { scan-assembler-not {vpcmp} }

Re: [x86 PATCH] Improve reg pressure of double-word right-shift then truncate.

2023-11-12 Thread Uros Bizjak
On Sun, Nov 12, 2023 at 10:03 PM Roger Sayle  wrote:
>
>
> This patch improves register pressure during reload, inspired by PR 97756.
> Normally, a double-word right-shift by a constant produces a double-word
> result, the highpart of which is dead when followed by a truncation.
> The dead code calculating the high part gets cleaned up post-reload, so
> the issue isn't normally visible, except for the increased register
> pressure during reload, sometimes leading to odd register assignments.
> Providing a post-reload splitter, which clobbers a single wordmode
> result register instead of a doubleword result register, helps (a bit).
>
> An example demonstrating this effect is:
>
> #define MASK60 ((1ul << 60) - 1)
> unsigned long foo (__uint128_t n)
> {
>   unsigned long a = n & MASK60;
>   unsigned long b = (n >> 60);
>   b = b & MASK60;
>   unsigned long c = (n >> 120);
>   return a+b+c;
> }
>
> which currently with -O2 generates (13 instructions):
> foo:movabsq $1152921504606846975, %rcx
> xchgq   %rdi, %rsi
> movq%rsi, %rax
> shrdq   $60, %rdi, %rax
> movq%rax, %rdx
> movq%rsi, %rax
> movq%rdi, %rsi
> andq%rcx, %rax
> shrq$56, %rsi
> andq%rcx, %rdx
> addq%rsi, %rax
> addq%rdx, %rax
> ret
>
> with this patch, we generate one less mov (12 instructions):
> foo:movabsq $1152921504606846975, %rcx
> xchgq   %rdi, %rsi
> movq%rdi, %rdx
> movq%rsi, %rax
> movq%rdi, %rsi
> shrdq   $60, %rdi, %rdx
> andq%rcx, %rax
> shrq$56, %rsi
> addq%rsi, %rax
> andq%rcx, %rdx
> addq%rdx, %rax
> ret
>
> The significant difference is easier to see via diff:
> <   shrdq   $60, %rdi, %rax
> <   movq%rax, %rdx
> ---
> >   shrdq   $60, %rdi, %rdx
>
>
> Admittedly a single "mov" isn't much of a saving on modern architectures,
> but as demonstrated by the PR, people still track the number of them.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-11-12  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (3_doubleword_lowpart): New
> define_insn_and_split to optimize register usage of doubleword
> right shifts followed by truncation.


+;; Split truncations of TImode right shifts into x86_64_shrd_1.
+;; Split truncations of DImode right shifts into x86_shrd_1.

You can just say

;; Split truncations of double word right shifts into x86_shrd_1.

OK with the above change.

Thanks,
Uros.

>
> Thanks in advance,
> Roger
> --
>


RE: [PATCH v3 2/2]middle-end match.pd: optimize fneg (fabs (x)) to copysign (x, -1) [PR109154]

2023-11-12 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, November 13, 2023 7:09 AM
> To: Andrew Pinski 
> Cc: Tamar Christina ; Prathamesh Kulkarni
> ; gcc-patches@gcc.gnu.org; nd
> ; j...@ventanamicro.com
> Subject: Re: [PATCH v3 2/2]middle-end match.pd: optimize fneg (fabs (x)) to
> copysign (x, -1) [PR109154]
> 
> On Fri, 10 Nov 2023, Andrew Pinski wrote:
> 
> > On Fri, Nov 10, 2023 at 5:12?AM Richard Biener 
> wrote:
> > >
> > > On Fri, 10 Nov 2023, Tamar Christina wrote:
> > >
> > > >
> > > > Hi Prathamesh,
> > > >
> > > > Yes Arm requires SIMD for copysign. The testcases fail because they 
> > > > don't
> turn on Neon.
> > > >
> > > > I'll update them.
> > >
> > > On x86_64 with -m32 I see
> > >
> > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized ".COPYSIGN"
> > > 1
> > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized "ABS_EXPR" 1
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "=
> ABS_EXPR"
> > > 1
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= -" 1
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "=
> .COPYSIGN"
> > > 2
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^n]* = -" 4
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^n]* = .COPYSIGN" 2
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^n]* = ABS_EXPR <" 1
> > > FAIL: gcc.dg/tree-ssa/phi-opt-24.c scan-tree-dump-not phiopt2 "if"
> > >
> > > maybe add a copysign effective target?
> >
> > I get the feeling that the internal function for copysign should not
> > be a direct internal function for scalar modes and call
> > expand_copysign instead when expanding.
> > This will fix some if not all of the issues where COPYSIGN is now
> > trying to show up.
> 
> But then I'd rather have a COPYSIGN_EXPR tree code, leaving internal-fns to
> optab mappings.  We've discussed this and discarded any of this as too much
> work right now.

I have a patch mostly written for next stage1 that will allow IFNs to have a 
fallback
Target hook as an option.  This would then allow us to remove things like 
XORSIGN
as well.  Atm it's just a prototype and I don't have time to finish it before 
stage1 ends
but it cleans things up a lot in targets that don't use the copysign RTX code.

Regards,
Tamar
> 
> But yes, the situation is a bit messy (as also discussed).
> 
> Richard.
> 
> > BY the way this is most likely PR 88786 (and PR 112468 and a few
> > others). and PR 58797 .
> >
> > Thanks,
> > Andrew
> >
> >
> >
> > >
> > > > Regards,
> > > > Tamar
> > > > 
> > > > From: Prathamesh Kulkarni 
> > > > Sent: Friday, November 10, 2023 12:24 PM
> > > > To: Tamar Christina 
> > > > Cc: gcc-patches@gcc.gnu.org ; nd
> > > > ; rguent...@suse.de ;
> > > > j...@ventanamicro.com 
> > > > Subject: Re: [PATCH v3 2/2]middle-end match.pd: optimize fneg
> > > > (fabs (x)) to copysign (x, -1) [PR109154]
> > > >
> > > > On Mon, 6 Nov 2023 at 15:50, Tamar Christina
>  wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > This patch transforms fneg (fabs (x)) into copysign (x, -1)
> > > > > which is more canonical and allows a target to expand this
> > > > > sequence efficiently.  Such sequences are common in scientific code
> working with gradients.
> > > > >
> > > > > There is an existing canonicalization of copysign (x, -1) to
> > > > > fneg (fabs (x)) which I remove since this is a less efficient
> > > > > form.  The testsuite is also updated in light of this.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > > Hi Tamar,
> > > > It seems the patch caused following regressions on arm:
> > > >
> > > > Running gcc:gcc.dg/dg.exp ...
> > > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized
> > > > ".COPYSIGN" 1
> > > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized "ABS_EXPR"
> > > > 1
> > > >
> > > > Running gcc:gcc.dg/tree-ssa/tree-ssa.exp ...
> > > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= -"
> > > > 1
> > > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "=
> > > > .COPYSIGN" 2
> > > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "=
> > > > ABS_EXPR" 1
> > > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > > "Deleting[^\\n]* = -" 4
> > > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > > "Deleting[^\\n]* = ABS_EXPR <" 1
> > > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > > "Deleting[^\\n]* = \\.COPYSIGN" 2
> > > > FAIL: gcc.dg/tree-ssa/copy-sign-2.c scan-tree-dump-times optimized
> > > > ".COPYSIGN" 1
> > > > FAIL: gcc.dg/tree-ssa/copy-sign-2.c scan-tree-dump-times optimized
> > > > "ABS" 1
> > > > FAIL: gcc.dg/tree-ssa/mult-abs-2.c scan-tree-dump-times gimple
> > > > ".COPYSIGN" 4
> > > > FAIL: gcc.dg/tree-ssa/mult-abs-2.c scan-tree-dump-times gimple
>

RE: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x)) simplification [PR112483]

2023-11-12 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Monday, November 13, 2023 6:55 AM
> To: Xi Ruoyao 
> Cc: gcc-patches@gcc.gnu.org; chenglulu ;
> i...@xen0n.name; xucheng...@loongson.cn; Tamar Christina
> ; tschwi...@gcc.gnu.org; Roger Sayle
> 
> Subject: Re: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x))
> simplification [PR112483]
> 
> On Sun, Nov 12, 2023 at 9:27 PM Xi Ruoyao  wrote:
> >
> > (fcopysign x, NEGATIVE_CONST) can be simplified to (fneg (fabs x)),
> > but a logic error in the code caused it mistakenly simplified to (fneg
> > x) instead.

The fix aside, I actually wonder if simplify-rtx.cc should be doing this at all.
The mid-end didn't do it because the target said it had an optab for the
copysign operation.  Otherwise during expand_COPYSIGN it would have been
expanded as FNEG (FABS (..)) already.

In the case of e.g. longaarch64 It looks like the target actually has an 
fcopysign
Instruction, so wouldn't this rewriting by simplify-rtx be a de-optimization?

Thanks,
Tamar
> 
> OK.
> 
> > gcc/ChangeLog:
> >
> > PR rtl-optimization/112483
> > * simplify-rtx.cc (simplify_binary_operation_1) :
> > Fix the simplification of (fcopysign x, NEGATIVE_CONST).
> > ---
> >
> > Bootstrapped and regtested on loongarch64-linux-gnu and
> > x86_64-linux-gnu.  Ok for trunk?
> >
> >  gcc/simplify-rtx.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index
> > 69d87579d9c..2d2e5a3c1ca 100644
> > --- a/gcc/simplify-rtx.cc
> > +++ b/gcc/simplify-rtx.cc
> > @@ -4392,7 +4392,7 @@ simplify_ashift:
> >   real_convert (&f1, mode, CONST_DOUBLE_REAL_VALUE (trueop1));
> >   rtx tmp = simplify_gen_unary (ABS, mode, op0, mode);
> >   if (REAL_VALUE_NEGATIVE (f1))
> > -   tmp = simplify_gen_unary (NEG, mode, op0, mode);
> > +   tmp = simplify_gen_unary (NEG, mode, tmp, mode);
> >   return tmp;
> > }
> >if (GET_CODE (op0) == NEG || GET_CODE (op0) == ABS)
> > --
> > 2.42.1
> >


Re: [PATCH v3 2/2]middle-end match.pd: optimize fneg (fabs (x)) to copysign (x, -1) [PR109154]

2023-11-12 Thread Richard Biener
On Fri, 10 Nov 2023, Andrew Pinski wrote:

> On Fri, Nov 10, 2023 at 5:12?AM Richard Biener  wrote:
> >
> > On Fri, 10 Nov 2023, Tamar Christina wrote:
> >
> > >
> > > Hi Prathamesh,
> > >
> > > Yes Arm requires SIMD for copysign. The testcases fail because they don't 
> > > turn on Neon.
> > >
> > > I'll update them.
> >
> > On x86_64 with -m32 I see
> >
> > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized ".COPYSIGN" 1
> > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized "ABS_EXPR" 1
> > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= ABS_EXPR"
> > 1
> > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= -" 1
> > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= .COPYSIGN"
> > 2
> > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > "Deleting[^n]* = -" 4
> > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > "Deleting[^n]* = .COPYSIGN" 2
> > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > "Deleting[^n]* = ABS_EXPR <" 1
> > FAIL: gcc.dg/tree-ssa/phi-opt-24.c scan-tree-dump-not phiopt2 "if"
> >
> > maybe add a copysign effective target?
> 
> I get the feeling that the internal function for copysign should not
> be a direct internal function for scalar modes and call
> expand_copysign instead when expanding.
> This will fix some if not all of the issues where COPYSIGN is now
> trying to show up.

But then I'd rather have a COPYSIGN_EXPR tree code, leaving internal-fns
to optab mappings.  We've discussed this and discarded any of this as
too much work right now.

But yes, the situation is a bit messy (as also discussed).

Richard.
 
> BY the way this is most likely PR 88786 (and PR 112468 and a few
> others). and PR 58797 .
>
> Thanks,
> Andrew
> 
> 
> 
> >
> > > Regards,
> > > Tamar
> > > 
> > > From: Prathamesh Kulkarni 
> > > Sent: Friday, November 10, 2023 12:24 PM
> > > To: Tamar Christina 
> > > Cc: gcc-patches@gcc.gnu.org ; nd ; 
> > > rguent...@suse.de ; j...@ventanamicro.com 
> > > 
> > > Subject: Re: [PATCH v3 2/2]middle-end match.pd: optimize fneg (fabs (x)) 
> > > to copysign (x, -1) [PR109154]
> > >
> > > On Mon, 6 Nov 2023 at 15:50, Tamar Christina  
> > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This patch transforms fneg (fabs (x)) into copysign (x, -1) which is 
> > > > more
> > > > canonical and allows a target to expand this sequence efficiently.  Such
> > > > sequences are common in scientific code working with gradients.
> > > >
> > > > There is an existing canonicalization of copysign (x, -1) to fneg (fabs 
> > > > (x))
> > > > which I remove since this is a less efficient form.  The testsuite is 
> > > > also
> > > > updated in light of this.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > Hi Tamar,
> > > It seems the patch caused following regressions on arm:
> > >
> > > Running gcc:gcc.dg/dg.exp ...
> > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized ".COPYSIGN" 1
> > > FAIL: gcc.dg/pr55152-2.c scan-tree-dump-times optimized "ABS_EXPR" 1
> > >
> > > Running gcc:gcc.dg/tree-ssa/tree-ssa.exp ...
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= -" 1
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= 
> > > .COPYSIGN" 2
> > > FAIL: gcc.dg/tree-ssa/abs-4.c scan-tree-dump-times optimized "= ABS_EXPR" 
> > > 1
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^\\n]* = -" 4
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^\\n]* = ABS_EXPR <" 1
> > > FAIL: gcc.dg/tree-ssa/backprop-6.c scan-tree-dump-times backprop
> > > "Deleting[^\\n]* = \\.COPYSIGN" 2
> > > FAIL: gcc.dg/tree-ssa/copy-sign-2.c scan-tree-dump-times optimized 
> > > ".COPYSIGN" 1
> > > FAIL: gcc.dg/tree-ssa/copy-sign-2.c scan-tree-dump-times optimized "ABS" 1
> > > FAIL: gcc.dg/tree-ssa/mult-abs-2.c scan-tree-dump-times gimple 
> > > ".COPYSIGN" 4
> > > FAIL: gcc.dg/tree-ssa/mult-abs-2.c scan-tree-dump-times gimple "ABS" 4
> > > FAIL: gcc.dg/tree-ssa/phi-opt-24.c scan-tree-dump-not phiopt2 "if"
> > > Link to log files:
> > > https://ci.linaro.org/job/tcwg_gcc_check--master-arm-build/1240/artifact/artifacts/00-sumfiles/
> > >
> > > Even for following test-case:
> > > double g (double a)
> > > {
> > >   double t1 = fabs (a);
> > >   double t2 = -t1;
> > >   return t2;
> > > }
> > >
> > > It seems, the pattern gets applied but doesn't get eventually
> > > simplified to copysign(a, -1).
> > > forwprop dump shows:
> > > Applying pattern match.pd:1131, gimple-match-4.cc:4134
> > > double g (double a)
> > > {
> > >   double t2;
> > >   double t1;
> > >
> > >:
> > >   t1_2 = ABS_EXPR ;
> > >   t2_3 = -t1_2;
> > >   return t2_3;
> > >
> > > }
> > >
> > > while on x86_64:
> > > Applying pattern match.pd:1131, gimple-match-4.cc:4134
> > > gimple_simplified to t2_3 = .COPYSIGN (a_1(D), -1.0e+0);
> > > Removin

Re: [PATCH] gimple-range-cache: Fix ICEs when dumping details [PR111967]

2023-11-12 Thread Richard Biener
On Sat, Nov 11, 2023 at 9:36 AM Jakub Jelinek  wrote:
>
> Hi!
>
> The following testcase ICEs when dumping details.
> When m_ssa_ranges vector is created, it is safe_grow_cleared (num_ssa_names),
> but when when some new SSA_NAME is added, we strangely grow it to
> num_ssa_names + 1 instead and later on the 3 argument dump method
> iterates from 1 to m_ssa_ranges.length () - 1 and uses ssa_name (x)
> on each; but because set_bb_range grew it one too much, ssa_name
> (m_ssa_ranges.length () - 1) might be after the end of the ssanames
> vector and ICE.
>
> The fix grows the vector consistently only to num_ssa_names,
> doesn't waste time checking m_ssa_ranges[0] because there is no
> ssa_names (0), it is always NULL, before using ssa_name (x) checks
> if we'll need it at all (we check later if m_ssa_ranges[x] is non-NULL,
> so we might check it earlier as well) and also in the last loop
> iterates until m_ssa_ranges.length () rather than num_ssa_names, I don't
> see a reason for the inconsistency and in theory some SSA_NAME could be
> added without set_bb_range called for it and the vector could be shorter
> than the ssanames vector.
>
> To actually fix the ICE, either the first hunk or the last 2 hunks
> would be enough, but I think it doesn't hurt to change all the spots.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

OK.

> 2023-11-11  Jakub Jelinek  
>
> PR tree-optimization/111967
> * gimple-range-cache.cc (block_range_cache::set_bb_range): Grow
> m_ssa_ranges to num_ssa_names rather than num_ssa_names + 1.
> (block_range_cache::dump): Iterate from 1 rather than 0.  Don't use
> ssa_name (x) unless m_ssa_ranges[x] is non-NULL.  Iterate to
> m_ssa_ranges.length () rather than num_ssa_names.
>
> * gcc.dg/tree-ssa/pr111967.c: New test.
>
> --- gcc/gimple-range-cache.cc.jj2023-10-10 11:56:05.819220320 +0200
> +++ gcc/gimple-range-cache.cc   2023-11-10 17:06:52.482867324 +0100
> @@ -390,7 +390,7 @@ block_range_cache::set_bb_range (tree na
>  {
>unsigned v = SSA_NAME_VERSION (name);
>if (v >= m_ssa_ranges.length ())
> -m_ssa_ranges.safe_grow_cleared (num_ssa_names + 1);
> +m_ssa_ranges.safe_grow_cleared (num_ssa_names);
>
>if (!m_ssa_ranges[v])
>  {
> @@ -465,7 +465,7 @@ void
>  block_range_cache::dump (FILE *f)
>  {
>unsigned x;
> -  for (x = 0; x < m_ssa_ranges.length (); ++x)
> +  for (x = 1; x < m_ssa_ranges.length (); ++x)
>  {
>if (m_ssa_ranges[x])
> {
> @@ -487,11 +487,14 @@ block_range_cache::dump (FILE *f, basic_
>bool summarize_varying = false;
>for (x = 1; x < m_ssa_ranges.length (); ++x)
>  {
> +  if (!m_ssa_ranges[x])
> +   continue;
> +
>if (!gimple_range_ssa_p (ssa_name (x)))
> continue;
>
>Value_Range r (TREE_TYPE (ssa_name (x)));
> -  if (m_ssa_ranges[x] && m_ssa_ranges[x]->get_bb_range (r, bb))
> +  if (m_ssa_ranges[x]->get_bb_range (r, bb))
> {
>   if (!print_varying && r.varying_p ())
> {
> @@ -508,13 +511,16 @@ block_range_cache::dump (FILE *f, basic_
>if (summarize_varying)
>  {
>fprintf (f, "VARYING_P on entry : ");
> -  for (x = 1; x < num_ssa_names; ++x)
> +  for (x = 1; x < m_ssa_ranges.length (); ++x)
> {
> + if (!m_ssa_ranges[x])
> +   continue;
> +
>   if (!gimple_range_ssa_p (ssa_name (x)))
> continue;
>
>   Value_Range r (TREE_TYPE (ssa_name (x)));
> - if (m_ssa_ranges[x] && m_ssa_ranges[x]->get_bb_range (r, bb))
> + if (m_ssa_ranges[x]->get_bb_range (r, bb))
> {
>   if (r.varying_p ())
> {
> --- gcc/testsuite/gcc.dg/tree-ssa/pr111967.c.jj 2023-11-10 16:45:54.006085324 
> +0100
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr111967.c2023-11-10 17:03:17.257844360 
> +0100
> @@ -0,0 +1,15 @@
> +/* PR tree-optimization/111967 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-tree-forwprop -fdump-tree-evrp-all" } */
> +
> +void bar (char *);
> +int a;
> +char *b;
> +
> +void
> +foo (void)
> +{
> +  long c = a & 3;
> +  if (c)
> +bar (b + c);
> +}
>
> Jakub
>


Re: [RFC] Intel AVX10.1 Compiler Design and Support

2023-11-12 Thread Hongtao Liu
On Fri, Nov 10, 2023 at 6:15 PM Richard Biener
 wrote:
>
> On Fri, Nov 10, 2023 at 2:42 AM Haochen Jiang  wrote:
> >
> > Hi all,
> >
> > This RFC patch aims to add AVX10.1 options. After we added -m[no-]evex512
> > support, it makes a lot easier to add them comparing to the August version.
> > Detail for AVX10 is shown below:
> >
> > Intel Advanced Vector Extensions 10 (Intel AVX10) Architecture Specification
> > It describes the Intel Advanced Vector Extensions 10 Instruction Set
> > Architecture.
> > https://cdrdv2.intel.com/v1/dl/getContent/784267
> >
> > The Converged Vector ISA: Intel Advanced Vector Extensions 10 Technical 
> > Paper
> > It provides introductory information regarding the converged vector ISA: 
> > Intel
> > Advanced Vector Extensions 10.
> > https://cdrdv2.intel.com/v1/dl/getContent/784343
> >
> > Our proposal is to take AVX10.1-256 and AVX10.1-512 as two "virtual" ISAs in
> > the compiler. AVX10.1-512 will imply AVX10.1-256. They will not enable
> > anything at first. At the end of the option handling, we will check whether
> > the two bits are set. If AVX10.1-256 is set, we will set the AVX512 related
> > ISA bits. AVX10.1-512 will further set EVEX512 ISA bit.
> >
> > It means that AVX10 options will be separated from the existing AVX512 and 
> > the
> > newly added -m[no-]evex512 options. AVX10 and AVX512 options will control
> > (enable/disable/set vector size) the AVX512 features underneath 
> > independently.
> > If there’s potential overlap or conflict between AVX10 and AVX512 options,
> > some rules are provided to define the behavior, which will be described 
> > below.
> >
> > avx10.1 option will be provided as an alias of avx10.1-256.
> >
> > In the future, the AVX10 options will imply like this:
> >
> > AVX10.1-256 < AVX10.1-512
> >  ^ ^
> >  | |
> >
> > AVX10.2-256 < AVX10.2-512
> >  ^ ^
> >  | |
> >
> > AVX10.3-256 < AVX10.3-512
> >  ^ ^
> >  | |
> >
> > Each of them will have its own option to enable/disabled corresponding
> > features. The alias avx10.x will also be provided.
> >
> > As mentioned in August version RFC, since we lean towards the adoption of
> > AVX10 instead of AVX512 from now on, we don’t recommend users to combine the
> > AVX10 and legacy AVX512 options.
>
> I wonder whether adoption could be made easier by also providing a
> -mavx10[.0] level that removes some of the more obscure sub-ISA requirements
> to cover more existing implementations (I'd not add -mavx10.0-512 here).
> I'd require only skylake-AVX512 features here, basically all non-KNL AVX512
> CPUs should have a "virtual" AVX10 level that allows to use that feature set,
We have -mno-evex512 can cover those cases, so what you want is like a
simple alias of "-march=skylake-avx512 -mno-evex512"?
> restricted to 256bits so future AVX10-256 implementations can handle it
> as well as all existing (and relevant, which excludes KNL) AVX512
> implementations.
>
> Otherwise AVX10 is really a hard sell (as AVX512 was originally).
It's a rebranding of the existing AVX512 to AVX10, AVX10.0  just
complicated things further(considering we already have x86-64-v4 which
is different from skylake-avx512).
>
> > However, we would like to introduce some
> > simple rules for user when it comes to combination.
> >
> > 1. Enabling AVX10 and AVX512 at the same command line with different vector
> > size will lead to a warning message. The behavior of the compiler will be
> > enabling AVX10 with longer, i.e., 512 bit vector size.
> >
> > If the vector sizes are the same (e.g. -mavx10.1-256 -mavx512f -mno-evex512,
> > -mavx10.1-512 -mavx512f), it will be valid with the corresponding vector 
> > size.
> >
> > 2. -mno-avx10.1 option can’t disable any features enabled by AVX512 options 
> > or
> > impact the vector size, and vice versa. The compiler will emit warnings if
> > necessary.
> >
> > For the auto dispatch support including function multi versioning, function
> > attribute usage, the behavior will be identical to compiler options.
> >
> > If you have any questions, feel free to ask in this thread.
> >
> > Thx,
> > Haochen
> >
> >



-- 
BR,
Hongtao


Re: [PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x)) simplification [PR112483]

2023-11-12 Thread Richard Biener
On Sun, Nov 12, 2023 at 9:27 PM Xi Ruoyao  wrote:
>
> (fcopysign x, NEGATIVE_CONST) can be simplified to (fneg (fabs x)), but
> a logic error in the code caused it mistakenly simplified to (fneg x)
> instead.

OK.

> gcc/ChangeLog:
>
> PR rtl-optimization/112483
> * simplify-rtx.cc (simplify_binary_operation_1) :
> Fix the simplification of (fcopysign x, NEGATIVE_CONST).
> ---
>
> Bootstrapped and regtested on loongarch64-linux-gnu and
> x86_64-linux-gnu.  Ok for trunk?
>
>  gcc/simplify-rtx.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index 69d87579d9c..2d2e5a3c1ca 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -4392,7 +4392,7 @@ simplify_ashift:
>   real_convert (&f1, mode, CONST_DOUBLE_REAL_VALUE (trueop1));
>   rtx tmp = simplify_gen_unary (ABS, mode, op0, mode);
>   if (REAL_VALUE_NEGATIVE (f1))
> -   tmp = simplify_gen_unary (NEG, mode, op0, mode);
> +   tmp = simplify_gen_unary (NEG, mode, tmp, mode);
>   return tmp;
> }
>if (GET_CODE (op0) == NEG || GET_CODE (op0) == ABS)
> --
> 2.42.1
>


Re: [PATCH] testsuite: Fix bad-mapper-1.C test failures with posix_spawn

2023-11-12 Thread Richard Biener
On Sun, Nov 12, 2023 at 12:12 AM Brendan Shanks  wrote:
>
> bad-mapper-1.C has been failing since the posix_spawn codepath was added
> to libiberty, adjust the check to accept the changed error message.
>
> Patch has been verified on x86_64 Linux.

OK

> gcc/testsuite:
>
> * g++.dg/modules/bad-mapper-1.C: Also accept posix_spawn.
>
> Signed-off-by: Brendan Shanks 
> ---
>  gcc/testsuite/g++.dg/modules/bad-mapper-1.C | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/testsuite/g++.dg/modules/bad-mapper-1.C 
> b/gcc/testsuite/g++.dg/modules/bad-mapper-1.C
> index 4b2312885d8..53e3e1d0c88 100644
> --- a/gcc/testsuite/g++.dg/modules/bad-mapper-1.C
> +++ b/gcc/testsuite/g++.dg/modules/bad-mapper-1.C
> @@ -1,6 +1,6 @@
>  //  { dg-additional-options "-fmodules-ts 
> -fmodule-mapper=|this-will-not-work" }
>  import unique1.bob;
> -// { dg-error "-:failed (exec|CreateProcess).*mapper.* .*this-will-not-work" 
> "" { target { ! { *-*-darwin[89]* *-*-darwin10* } } } 0 }
> +// { dg-error "-:failed (exec|CreateProcess|posix_spawn).*mapper.* 
> .*this-will-not-work" "" { target { ! { *-*-darwin[89]* *-*-darwin10* } } } 0 
> }
>  // { dg-prune-output "fatal error:" }
>  // { dg-prune-output "failed to read" }
>  // { dg-prune-output "compilation terminated" }
> --
> 2.41.0
>


Re: [PATCH] Avoid generate vblendps with ymm16+

2023-11-12 Thread Hongtao Liu
On Sat, Nov 11, 2023 at 4:11 AM Jakub Jelinek  wrote:
>
> On Thu, Nov 09, 2023 at 03:27:11PM +0800, Hongtao Liu wrote:
> > On Thu, Nov 9, 2023 at 3:15 PM Hu, Lin1  wrote:
> > >
> > > This patch aims to avoid generate vblendps with ymm16+, And have
> > > bootstrapped and tested on x86_64-pc-linux-gnu{-m32,-m64}. Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/112435
> > > * config/i386/sse.md: Adding constraints to restrict the 
> > > generation of
> > > vblendps.
> > It should be "Don't output vblendps when evex sse reg or gpr32 is involved."
> > Others LGTM.
>
> I've missed this patch, so wrote my own today, and am wondering
>
> 1) if it isn't better to use separate alternative instead of
>x86_evex_reg_mentioned_p, like in the patch below
vblendps doesn't support gpr32 which is checked by x86_evex_reg_mentioned_p.
we need to use xjm for operands[1], (I think we don't need to set
attribute addr to gpr16 for alternative 0 since the alternative 1 is
alway available and recog will match alternative1 when gpr32 is used)

> 2) why do you need the last two hunks in sse.md, both avx2_permv2ti and
>*avx_vperm2f128_nozero insns only use x in constraints, never v,
>so x86_evex_reg_mentioned_p ought to be always false there
true.
>
> Here is the untested patch, of course you have more testcases (though, I
> think it is better to test dg-do assemble with avx512vl target rather than
> dg-do compile and scan the assembler, after all, the problem was that it
> didn't assemble).
>
> 2023-11-10  Jakub Jelinek  
>
> PR target/112435
> * config/i386/sse.md (avx512vl_shuf_32x4_1,
> avx512dq_shuf_64x2_1): Add
> alternative with just x instead of v constraints and use vblendps
> as optimization only with that alternative.
>
> * gcc.target/i386/avx512vl-pr112435.c: New test.
>
> --- gcc/config/i386/sse.md.jj   2023-11-09 09:04:18.616543403 +0100
> +++ gcc/config/i386/sse.md  2023-11-10 15:56:44.138499931 +0100
> @@ -19235,11 +19235,11 @@ (define_expand "avx512dq_shuf_  })
>
>  (define_insn "avx512dq_shuf_64x2_1"
> -  [(set (match_operand:VI8F_256 0 "register_operand" "=v")
> +  [(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
> (vec_select:VI8F_256
>   (vec_concat:
> -   (match_operand:VI8F_256 1 "register_operand" "v")
> -   (match_operand:VI8F_256 2 "nonimmediate_operand" "vm"))
> +   (match_operand:VI8F_256 1 "register_operand" "x,v")
> +   (match_operand:VI8F_256 2 "nonimmediate_operand" "xm,vm"))
>   (parallel [(match_operand 3 "const_0_to_3_operand")
>  (match_operand 4 "const_0_to_3_operand")
>  (match_operand 5 "const_4_to_7_operand")
> @@ -19254,7 +19254,7 @@ (define_insn "avx512dq_shu
>mask = INTVAL (operands[3]) / 2;
>mask |= (INTVAL (operands[5]) - 4) / 2 << 1;
>operands[3] = GEN_INT (mask);
> -  if (INTVAL (operands[3]) == 2 && !)
> +  if (INTVAL (operands[3]) == 2 && ! && which_alternative == 0)
>  return "vblendps\t{$240, %2, %1, %0|%0, %1, %2, 240}";
>return "vshuf64x2\t{%3, %2, %1, 
> %0|%0, %1, %2, %3}";
>  }
> @@ -19386,11 +19386,11 @@ (define_expand "avx512vl_shuf_  })
>
>  (define_insn "avx512vl_shuf_32x4_1"
> -  [(set (match_operand:VI4F_256 0 "register_operand" "=v")
> +  [(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
> (vec_select:VI4F_256
>   (vec_concat:
> -   (match_operand:VI4F_256 1 "register_operand" "v")
> -   (match_operand:VI4F_256 2 "nonimmediate_operand" "vm"))
> +   (match_operand:VI4F_256 1 "register_operand" "x,v")
> +   (match_operand:VI4F_256 2 "nonimmediate_operand" "xm,vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
>  (match_operand 4 "const_0_to_7_operand")
>  (match_operand 5 "const_0_to_7_operand")
> @@ -19414,7 +19414,7 @@ (define_insn "avx512vl_shuf_mask |= (INTVAL (operands[7]) - 8) / 4 << 1;
>operands[3] = GEN_INT (mask);
>
> -  if (INTVAL (operands[3]) == 2 && !)
> +  if (INTVAL (operands[3]) == 2 && ! && which_alternative == 0)
>  return "vblendps\t{$240, %2, %1, %0|%0, %1, %2, 240}";
>
>return "vshuf32x4\t{%3, %2, %1, 
> %0|%0, %1, %2, %3}";
> --- gcc/testsuite/gcc.target/i386/avx512vl-pr112435.c.jj2023-11-10 
> 16:04:21.708046771 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512vl-pr112435.c   2023-11-10 
> 16:03:51.053479094 +0100
> @@ -0,0 +1,13 @@
> +/* PR target/112435 */
> +/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */
> +/* { dg-options "-mavx512vl -O2" } */
> +
> +#include 
> +
> +__m256i
> +foo (__m256i a, __m256i b)
> +{
> +  register __m256i c __asm__("ymm16") = a;
> +  asm ("" : "+v" (c));
> +  return _mm256_shuffle_i32x4 (c, b, 2);
> +}
>
> Jakub
>


-- 
BR,
Hongtao


[PATCH 2/1] c++/modules: Allow exporting a typedef redeclaration

2023-11-12 Thread Nathaniel Shead
I happened to be browsing the standard a bit later and noticed that we
incorrectly reject the example given below.

Bootstrapped on x86_64-pc-linux-gnu; regtesting ongoing but modules.exp
completed with no errors.

-- >8 --

A typedef doesn't create a new entity, and thus should be allowed to be
exported even if it has been previously declared un-exported. See the
example in [module.interface] p6:

  export module M;
  struct S { int n; };
  typedef S S;
  export typedef S S; // OK, does not redeclare an entity

PR c++/102341

gcc/cp/ChangeLog:

* decl.cc (duplicate_decls): Allow exporting a redeclaration of
a typedef.

gcc/testsuite/ChangeLog:

* g++.dg/modules/export-1.C: Adjust test.

Signed-off-by: Nathaniel Shead 
---
 gcc/cp/decl.cc  | 5 -
 gcc/testsuite/g++.dg/modules/export-1.C | 6 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index bde9bd79d58..5e175d3e835 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -2231,7 +2231,10 @@ duplicate_decls (tree newdecl, tree olddecl, bool 
hiding, bool was_hidden)
}
 
   tree not_tmpl = STRIP_TEMPLATE (olddecl);
-  if (DECL_LANG_SPECIFIC (not_tmpl) && DECL_MODULE_ATTACH_P (not_tmpl))
+  if (DECL_LANG_SPECIFIC (not_tmpl)
+ && DECL_MODULE_ATTACH_P (not_tmpl)
+ /* Typedefs are not entities and so can be exported later.  */
+ && TREE_CODE (olddecl) != TYPE_DECL)
{
  if (DECL_MODULE_EXPORT_P (STRIP_TEMPLATE (newdecl))
  && !DECL_MODULE_EXPORT_P (not_tmpl))
diff --git a/gcc/testsuite/g++.dg/modules/export-1.C 
b/gcc/testsuite/g++.dg/modules/export-1.C
index 3f93814d270..598814370ec 100644
--- a/gcc/testsuite/g++.dg/modules/export-1.C
+++ b/gcc/testsuite/g++.dg/modules/export-1.C
@@ -9,8 +9,12 @@ export int x (); // { dg-error "conflicting exporting for 
declaration" }
 int y;
 export extern int y; // { dg-error "conflicting exporting for declaration" }
 
+// A typedef is not an entity so the following is OK; see [module.interface] 
example 4
 typedef int z;
-export typedef int z; // { dg-error "conflicting exporting for declaration" }
+export typedef int z; // { dg-bogus "conflicting exporting for declaration" }
+
+template  using w = T;
+export template  using w = T;  // { dg-error "conflicting 
exporting for declaration" }
 
 template  int f (T);
 export template  int f (T); // { dg-error "conflicting exporting 
for declaration" }
-- 
2.42.0



[PATCH 3/4] c-family, C++: Handle clang attributes [PR109877].

2023-11-12 Thread Iain Sandoe
This adds the ability to defer the validation of numeric attribute
arguments until the sequence is parsed if the attribute being
handled is one known to be 'clang form'.

We do this by considering the arguments to be strings regardless
of content and defer the interpretation of those strings until the
argument processing.

Since the C++ front end lexes tokens separately (and would report
non-compliant numeric tokens before we can intercept them), we need
to implement a small state machine to track the lexing of attributes.

PR c++/109877

gcc/cp/ChangeLog:

* parser.cc (enum clang_attr_state): New.
(cp_lexer_attribute_state): New.
(cp_lexer_new_main): Set initial attribute lexing state.
(cp_lexer_get_preprocessor_token): Handle lexing of clang-
form attributes.
(cp_parser_clang_attribute): Handle clang-form attributes.
(cp_parser_gnu_attribute_list): Switch to clang-form parsing
where needed.
* parser.h : New flag to signal lexing clang-form attributes.

Signed-off-by: Iain Sandoe 
---
 gcc/cp/parser.cc | 230 +--
 gcc/cp/parser.h  |   3 +
 2 files changed, 227 insertions(+), 6 deletions(-)

diff --git a/gcc/cp/parser.cc b/gcc/cp/parser.cc
index 5116bcb78f6..c12f473f2e3 100644
--- a/gcc/cp/parser.cc
+++ b/gcc/cp/parser.cc
@@ -699,6 +699,91 @@ cp_lexer_handle_early_pragma (cp_lexer *lexer)
 static cp_parser *cp_parser_new (cp_lexer *);
 static GTY (()) cp_parser *the_parser;
 
+/* Context-sensitive parse-checking for clang-style attributes.  */
+
+enum clang_attr_state {
+  CA_NONE = 0,
+  CA_ATTR,
+  CA_BR1, CA_BR2,
+  CA_LIST,
+  CA_LIST_ARGS,
+  CA_IS_CA,
+  CA_CA_ARGS,
+  CA_LIST_CONT
+};
+
+/* State machine tracking context of attribute lexing.  */
+
+static enum clang_attr_state
+cp_lexer_attribute_state (cp_token& token, enum clang_attr_state attr_state)
+{
+  /* Implement a context-sensitive parser for clang attributes.
+ We detect __attribute__((clang_style_attribute (ARGS))) and lex the
+ args ARGS with the following differences from GNU attributes:
+   (a) number-like values are lexed as strings [this allows lexing XX.YY.ZZ
+  version numbers].
+   (b) we concatenate strings, since clang attributes allow this too.  */
+  switch (attr_state)
+{
+case CA_NONE:
+  if (token.type == CPP_KEYWORD
+ && token.keyword == RID_ATTRIBUTE)
+   attr_state = CA_ATTR;
+  break;
+case CA_ATTR:
+  if (token.type == CPP_OPEN_PAREN)
+   attr_state = CA_BR1;
+  else
+   attr_state = CA_NONE;
+  break;
+case CA_BR1:
+  if (token.type == CPP_OPEN_PAREN)
+   attr_state = CA_BR2;
+  else
+   attr_state = CA_NONE;
+  break;
+case CA_BR2:
+  if (token.type == CPP_NAME)
+   {
+ tree identifier = (token.type == CPP_KEYWORD)
+   /* For keywords, use the canonical spelling, not the
+  parsed identifier.  */
+   ? ridpointers[(int) token.keyword]
+   : token.u.value;
+ identifier = canonicalize_attr_name (identifier);
+ if (attribute_clang_form_p (identifier))
+   attr_state = CA_IS_CA;
+ else
+   attr_state = CA_LIST;
+   }
+  else
+   attr_state = CA_NONE;
+  break;
+case CA_IS_CA:
+case CA_LIST:
+  if (token.type == CPP_COMMA)
+   attr_state = CA_BR2; /* Back to the list outer.  */
+  else if (token.type == CPP_OPEN_PAREN)
+   attr_state = attr_state == CA_IS_CA ? CA_CA_ARGS
+   : CA_LIST_ARGS;
+  else
+   attr_state = CA_NONE;
+  break;
+case CA_CA_ARGS: /* We will special-case args in this state.  */
+case CA_LIST_ARGS:
+  if (token.type == CPP_CLOSE_PAREN)
+   attr_state = CA_LIST_CONT;
+  break;
+case CA_LIST_CONT:
+  if (token.type == CPP_COMMA)
+   attr_state = CA_BR2; /* Back to the list outer.  */
+  else
+   attr_state = CA_NONE;
+  break;
+}
+  return attr_state;
+}
+
 /* Create a new main C++ lexer, the lexer that gets tokens from the
preprocessor, and also create the main parser.  */
 
@@ -715,6 +800,9 @@ cp_lexer_new_main (void)
   c_common_no_more_pch ();
 
   cp_lexer *lexer = cp_lexer_alloc ();
+  lexer->lex_number_as_string_p = false;
+  enum clang_attr_state attr_state = CA_NONE;
+
   /* Put the first token in the buffer.  */
   cp_token *tok = lexer->buffer->quick_push (token);
 
@@ -738,8 +826,20 @@ cp_lexer_new_main (void)
   if (tok->type == CPP_PRAGMA_EOL)
cp_lexer_handle_early_pragma (lexer);
 
+  attr_state = cp_lexer_attribute_state (*tok, attr_state);
   tok = vec_safe_push (lexer->buffer, cp_token ());
-  cp_lexer_get_preprocessor_token (C_LEX_STRING_NO_JOIN, tok);
+  unsigned int flags = C_LEX_STRING_NO_JOIN;
+  if (attr_state == CA_CA_ARGS)
+   {
+ /* We are handling a clang attribute;
+   

[PATCH 4/4] Darwin: Implement clang availability attribute [PR109877].

2023-11-12 Thread Iain Sandoe
This implements the handling of the clang-form "availability"
attribute, which is the most important case used in the the macOS
SDKs.

PR  c++/109877

gcc/ChangeLog:

* config/darwin-protos.h
(darwin_handle_weak_import_attribute): New.
(darwin_handle_availability_attribute): New.
(darwin_attribute_takes_identifier_p): New.
* config/darwin.cc (objc_method_decl): New.
(enum version_components): New.
(parse_version): New.
(version_from_version_array): New.
(os_version_from_avail_value): New.
(NUM_AV_OSES, NUM_AV_CLAUSES): New.
(darwin_handle_availability_attribute): New.
(darwin_attribute_takes_identifier_p): New.
(darwin_override_options): New.
* config/darwin.h (TARGET_ATTRIBUTE_TAKES_IDENTIFIER_P): New.

Signed-off-by: Iain Sandoe 
---
 gcc/config/darwin-protos.h |   9 +-
 gcc/config/darwin.cc   | 343 +
 gcc/config/darwin.h|   7 +-
 3 files changed, 355 insertions(+), 4 deletions(-)

diff --git a/gcc/config/darwin-protos.h b/gcc/config/darwin-protos.h
index 9df358ee7d3..0702db25178 100644
--- a/gcc/config/darwin-protos.h
+++ b/gcc/config/darwin-protos.h
@@ -86,9 +86,12 @@ extern void darwin_asm_lto_end (void);
 extern void darwin_mark_decl_preserved (const char *);
 
 extern tree darwin_handle_kext_attribute (tree *, tree, tree, int, bool *);
-extern tree darwin_handle_weak_import_attribute (tree *node, tree name,
-tree args, int flags,
-bool * no_add_attrs);
+extern tree darwin_handle_weak_import_attribute (tree *, tree, tree, int,
+bool *);
+extern tree darwin_handle_availability_attribute (tree *, tree, tree,
+ int, bool *);
+extern bool darwin_attribute_takes_identifier_p (const_tree);
+
 extern void machopic_output_stub (FILE *, const char *, const char *);
 extern void darwin_globalize_label (FILE *, const char *);
 extern void darwin_assemble_visibility (tree, int);
diff --git a/gcc/config/darwin.cc b/gcc/config/darwin.cc
index 621a94d74a2..8bb4a439996 100644
--- a/gcc/config/darwin.cc
+++ b/gcc/config/darwin.cc
@@ -29,6 +29,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfghooks.h"
 #include "df.h"
 #include "memmodel.h"
+#include "c-family/c-common.h"  /* enum rid.  */
 #include "tm_p.h"
 #include "stringpool.h"
 #include "attribs.h"
@@ -49,6 +50,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "optabs.h"
 #include "flags.h"
 #include "opts.h"
+#include "c-family/c-objc.h"/* for objc_method_decl().  */
 
 /* Fix and Continue.
 
@@ -102,6 +104,7 @@ int darwin_running_cxx;
 
 /* Some code-gen now depends on OS major version numbers (at least).  */
 int generating_for_darwin_version ;
+unsigned long current_os_version = 0;
 
 /* For older linkers we need to emit special sections (marked 'coalesced') for
for weak or single-definition items.  */
@@ -2181,6 +2184,96 @@ darwin_handle_kext_attribute (tree *node, tree name,
   return NULL_TREE;
 }
 
+enum version_components { MAJOR, MINOR, TINY };
+
+/* Parse a version number in x.y.z form and validate it as a macOS
+   version.  Ideally, we'd put this in a common place usable by the
+   Darwin backend.  */
+
+static bool
+parse_version (unsigned version_array[3], const char *version_str)
+{
+  size_t version_len;
+  char *end;
+
+  version_len = strlen (version_str);
+  if (version_len < 1)
+return false;
+
+  /* Version string must consist of digits and periods only.  */
+  if (strspn (version_str, "0123456789.") != version_len)
+return false;
+
+  if (!ISDIGIT (version_str[0]) || !ISDIGIT (version_str[version_len - 1]))
+return false;
+
+  version_array[MAJOR] = strtoul (version_str, &end, 10);
+  version_str = end + ((*end == '.') ? 1 : 0);
+  if (version_array[MAJOR]  > 99)
+return false;
+
+  /* Version string must not contain adjacent periods.  */
+  if (*version_str == '.')
+return false;
+
+  version_array[MINOR] = strtoul (version_str, &end, 10);
+  version_str = end + ((*end == '.') ? 1 : 0);
+  if (version_array[MINOR]  > 99)
+return false;
+
+  version_array[TINY] = strtoul (version_str, &end, 10);
+  if (version_array[TINY]  > 99)
+return false;
+
+  /* Version string must contain no more than three tokens.  */
+  if (*end != '\0')
+return false;
+
+  return true;
+}
+
+/* Turn a version expressed as maj.min.tiny into an unsigned long
+   integer representing the value used in macOS availability macros.  */
+
+static unsigned long
+version_from_version_array (unsigned vers[3])
+{
+  unsigned long res = 0;
+  /* Here, we follow the 'modern' / 'legacy' numbering scheme for versions.  */
+  if (vers[0] > 10 || vers[1] >= 10)
+res = vers[0] * 1 + vers[1] * 100 + vers[2];
+  else
+{
+  res = vers[0] * 1

[PATCH 2/4] c-family, C: handle clang attributes [PR109877].

2023-11-12 Thread Iain Sandoe
This adds the ability to defer the validation of numeric attribute
arguments until the sequence is parsed if the attribute being
handled is one known to be 'clang form'.

We do this by considering the arguments to be strings regardless
of content and defer the interpretation of those strings until the
argument processing.

PR c++/109877

gcc/c-family/ChangeLog:

* c-lex.cc (c_lex_with_flags): Allow for the case where
we wish to defer interpretation of numeric values until
parse time.
* c-pragma.h (C_LEX_NUMBER_AS_STRING): New.

gcc/c/ChangeLog:

* c-parser.cc (struct c_parser): Provide a flag to notify
that argument parsing should return attribute arguments
as string constants.
(c_lex_one_token): Act to defer numeric value validation.
(c_parser_clang_attribute_arguments): New.
(c_parser_gnu_attribute): Allow for clang-form GNU-style
attributes.

Signed-off-by: Iain Sandoe 
---
 gcc/c-family/c-lex.cc   |  15 ++
 gcc/c-family/c-pragma.h |   3 ++
 gcc/c/c-parser.cc   | 109 ++--
 3 files changed, 122 insertions(+), 5 deletions(-)

diff --git a/gcc/c-family/c-lex.cc b/gcc/c-family/c-lex.cc
index 06c2453c89a..d535f5b460c 100644
--- a/gcc/c-family/c-lex.cc
+++ b/gcc/c-family/c-lex.cc
@@ -533,6 +533,21 @@ c_lex_with_flags (tree *value, location_t *loc, unsigned 
char *cpp_flags,
 
 case CPP_NUMBER:
   {
+   /* If the user wants number-like entities to be returned as a raw
+  string, then don't try to classify them, which emits unwanted
+  diagnostics.  */
+   if (lex_flags & C_LEX_NUMBER_AS_STRING)
+ {
+   /* build_string adds a trailing NUL at [len].  */
+   tree num_string = build_string (tok->val.str.len + 1,
+   (const char *) tok->val.str.text);
+   TREE_TYPE (num_string) = char_array_type_node;
+   *value = num_string;
+   /* We will effectively note this as CPP_N_INVALID, because we
+  made no checks here.  */
+   break;
+ }
+
const char *suffix = NULL;
unsigned int flags = cpp_classify_number (parse_in, tok, &suffix, *loc);
 
diff --git a/gcc/c-family/c-pragma.h b/gcc/c-family/c-pragma.h
index 98177913053..11cde74f9f0 100644
--- a/gcc/c-family/c-pragma.h
+++ b/gcc/c-family/c-pragma.h
@@ -276,6 +276,9 @@ extern void pragma_lex_discard_to_eol ();
 #define C_LEX_STRING_NO_JOIN 2 /* Do not concatenate strings
   nor translate them into execution
   character set.  */
+#define C_LEX_NUMBER_AS_STRING   4 /* Do not classify a number, but
+  instead return it as a raw
+  string.  */
 
 /* This is not actually available to pragma parsers.  It's merely a
convenient location to declare this function for c-lex, after
diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 703f9570dbc..16cc05d 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -217,6 +217,9 @@ struct GTY(()) c_parser {
  should translate them to the execution character set (false
  inside attributes).  */
   BOOL_BITFIELD translate_strings_p : 1;
+  /* True if we want to lex arbitrary number-like sequences as their
+ string representation.  */
+  BOOL_BITFIELD lex_number_as_string : 1;
 
   /* Objective-C specific parser/lexer information.  */
 
@@ -308,10 +311,10 @@ c_lex_one_token (c_parser *parser, c_token *token, bool 
raw = false)
 
   if (raw || vec_safe_length (parser->raw_tokens) == 0)
 {
+  int lex_flags = parser->lex_joined_string ? 0 : C_LEX_STRING_NO_JOIN;
+  lex_flags |= parser->lex_number_as_string ? C_LEX_NUMBER_AS_STRING : 0;
   token->type = c_lex_with_flags (&token->value, &token->location,
- &token->flags,
- (parser->lex_joined_string
-  ? 0 : C_LEX_STRING_NO_JOIN));
+ &token->flags, lex_flags);
   token->id_kind = C_ID_NONE;
   token->keyword = RID_MAX;
   token->pragma_kind = PRAGMA_NONE;
@@ -5210,6 +5213,98 @@ c_parser_gnu_attribute_any_word (c_parser *parser)
   return attr_name;
 }
 
+/* Handle parsing clang-form attribute arguments, where we need to adjust
+   the parsing rules to relate to a specific attribute.  */
+
+static tree
+c_parser_clang_attribute_arguments (c_parser *parser, tree /*attr_id*/)
+{
+  /* We can, if required, alter the parsing on the basis of the attribute.
+ At present, we handle the availability attr, where ach entry can be :
+   identifier
+   identifier=N.MM.Z
+   identifier="string"
+   followed by ',' or ) for the last entry*/
+
+  tree attr_args = NULL_TREE;
+  if (c_parser_next_token_is (parser, CPP_NAME)
+  && c_parser_peek_token (pa

[PATCH 1/4] c-family: Add handling for clang-style attributes [PR109877].

2023-11-12 Thread Iain Sandoe
This patch set is not actually particualry new, I have been maintaining
it locally one Darwin branches and it has been tested on several versions
of Darwin both with and without Alex's __has_{feature, extension} patch.

This is one of the three most significant blockers to importing the macOS
SDKs properly, and cannot currently be fixincludes-ed (in fact it can not
ever really since the attribute is uaer-facing and so can be in end-user
code that we cannot fix).

OK for trunk?
thanks
Iain

--- 8< ---


The clang compiler supports essentially arbitrary, per-attribute, syntax and
token forms for attribute arguments.  This extends to the case where token
forms are required to be accepted that are not part of the valid set for
standard C or C++.

A motivating  example (in the initial attribute of this form implemented
in this patch set) is version-style (i.e. x.y.z) numeric values.  At present
the c-family cannot handle this, since invalid numeric tokens are rejected
by both C and C++ frontends before we have a chance to decide to accept them
in custom attribute argument parsing.

The solution proposed in this patch series is to allow for a certain set of
attributes names that are known to be 'clang-form' and to defer argument
token validation until the parse of those arguments.

This does not apparently represent any loss of generality - since the
specific attribute names are already claimed by clang and re-using them with
different semantics in GCC would be a highly unfortunate experience for end-
users.

The first patch here adds a mechanism to check attribute identifiers against
a list known to be in clang form.  The 'availability' attribute is added as a
first example.

The acceptance of non-standard tokens is constrained to the interval enclosing
the attribute arguments of cases notified as 'clang-form'.

PR c++/109877

gcc/c-family/ChangeLog:

* c-attribs.cc (attribute_clang_form_p): New.
* c-common.h (attribute_clang_form_p): New.

Signed-off-by: Iain Sandoe 
---
 gcc/c-family/c-attribs.cc | 12 
 gcc/c-family/c-common.h   |  1 +
 2 files changed, 13 insertions(+)

diff --git a/gcc/c-family/c-attribs.cc b/gcc/c-family/c-attribs.cc
index 461732f60f7..8c087317f4f 100644
--- a/gcc/c-family/c-attribs.cc
+++ b/gcc/c-family/c-attribs.cc
@@ -615,6 +615,18 @@ attribute_takes_identifier_p (const_tree attr_id)
 return targetm.attribute_takes_identifier_p (attr_id);
 }
 
+/* Returns TRUE iff the attribute indicated by ATTR_ID needs its
+   arguments converted to string constants.  */
+
+bool
+attribute_clang_form_p (const_tree attr_id)
+{
+  const struct attribute_spec *spec = lookup_attribute_spec (attr_id);
+  if (spec && !strcmp ("availability", spec->name))
+return true;
+  return false;
+}
+
 /* Verify that argument value POS at position ARGNO to attribute NAME
applied to function FN (which is either a function declaration or function
type) refers to a function parameter at position POS and the expected type
diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index b57e83d7c5d..4dbc566d2b5 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -1535,6 +1535,7 @@ extern void check_for_xor_used_as_pow (location_t 
lhs_loc, tree lhs_val,
 /* In c-attribs.cc.  */
 extern bool attribute_takes_identifier_p (const_tree);
 extern tree handle_deprecated_attribute (tree *, tree, tree, int, bool *);
+extern bool attribute_clang_form_p (const_tree);
 extern tree handle_unused_attribute (tree *, tree, tree, int, bool *);
 extern tree handle_fallthrough_attribute (tree *, tree, tree, int, bool *);
 extern int parse_tm_stmt_attr (tree, int);
-- 
2.39.2 (Apple Git-143)



[no subject]

2023-11-12 Thread Iain Sandoe
This patch set is not actually particulalry new, I have been maintaining
it locally one Darwin branches and it has been tested on several versions
of Darwin both with and without Alex's __has_{feature, extension} patch.

This is one of the three most significant blockers to importing the macOS
SDKs properly, and cannot currently be fixincludes-ed (in fact it can not
ever really since the attribute is uaer-facing and so can be in end-user
code that we cannot fix).

OK for trunk?
thanks
Iain




Re: [PATCH] LoongArch: Use simplify_gen_subreg instead of gen_rtx_SUBREG in loongarch_expand_vec_cond_mask_expr [PR112476]

2023-11-12 Thread chenglulu


在 2023/11/12 上午9:00, Xi Ruoyao 写道:

GCC internal says:

 'subreg's of 'subreg's are not supported.  Using
 'simplify_gen_subreg' is the recommended way to avoid this problem.

Unfortunately loongarch_expand_vec_cond_mask_expr might create nested
subreg under certain circumstances, causing an ICE.

Use simplify_gen_subreg as the internal document suggests.


 * Similar problems have been fixed once on LA:-(, thank you for your
   modification.


gcc/ChangeLog:

PR target/112476
* config/loongarch/loongarch.cc
(loongarch_expand_vec_cond_mask_expr): Call simplify_gen_subreg
instead of gen_rtx_SUBREG.

gcc/testsuite/ChangeLog:

PR target/112476
* gcc.target/loongarch/pr112476-1.c: New test.
* gcc.target/loongarch/pr112476-2.c: New test.
---

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk?

  gcc/config/loongarch/loongarch.cc | 11 ++---
  .../gcc.target/loongarch/pr112476-1.c | 24 +++
  .../gcc.target/loongarch/pr112476-2.c |  5 
  3 files changed, 37 insertions(+), 3 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/loongarch/pr112476-1.c
  create mode 100644 gcc/testsuite/gcc.target/loongarch/pr112476-2.c

diff --git a/gcc/config/loongarch/loongarch.cc 
b/gcc/config/loongarch/loongarch.cc
index d9b7a1076a2..0c7bafb5fb1 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -11197,7 +11197,9 @@ loongarch_expand_vec_cond_mask_expr (machine_mode mode, 
machine_mode vimode,
  if (mode != vimode)
{
  xop1 = gen_reg_rtx (vimode);
- emit_move_insn (xop1, gen_rtx_SUBREG (vimode, operands[1], 0));
+ emit_move_insn (xop1,
+ simplify_gen_subreg (vimode, operands[1],
+  mode, 0));
}
  emit_move_insn (src1, xop1);
}
@@ -11214,7 +11216,9 @@ loongarch_expand_vec_cond_mask_expr (machine_mode mode, 
machine_mode vimode,
  if (mode != vimode)
{
  xop2 = gen_reg_rtx (vimode);
- emit_move_insn (xop2, gen_rtx_SUBREG (vimode, operands[2], 0));
+ emit_move_insn (xop2,
+ simplify_gen_subreg (vimode, operands[2],
+  mode, 0));
}
  emit_move_insn (src2, xop2);
}
@@ -11233,7 +11237,8 @@ loongarch_expand_vec_cond_mask_expr (machine_mode mode, 
machine_mode vimode,
  gen_rtx_AND (vimode, mask, src1));
/* The result is placed back to a register with the mask.  */
emit_insn (gen_rtx_SET (mask, bsel));
-  emit_move_insn (operands[0], gen_rtx_SUBREG (mode, mask, 0));
+  emit_move_insn (operands[0], simplify_gen_subreg (mode, mask,
+   vimode, 0));
  }
  }
  
diff --git a/gcc/testsuite/gcc.target/loongarch/pr112476-1.c b/gcc/testsuite/gcc.target/loongarch/pr112476-1.c

new file mode 100644
index 000..4cf133e7a26
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/pr112476-1.c
@@ -0,0 +1,24 @@
+/* PR target/112476: ICE with -mlsx */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=loongarch64 -mfpu=64 -mabi=lp64d -mlsx" } */
+
+int foo, bar;
+float baz, res, a;
+
+void
+apply_adjacent_ternary (float *dst, float *src0)
+{
+  do
+{
+  __builtin_memcpy (&res, &src0, sizeof (res));
+  *dst = foo ? baz : res;
+  dst++;
+}
+  while (dst != src0);
+}
+
+void
+xx (void)
+{
+  apply_adjacent_ternary (&a, &a);
+}
diff --git a/gcc/testsuite/gcc.target/loongarch/pr112476-2.c 
b/gcc/testsuite/gcc.target/loongarch/pr112476-2.c
new file mode 100644
index 000..cc0dfbfc912
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/pr112476-2.c
@@ -0,0 +1,5 @@
+/* PR target/112476: ICE with -mlasx */
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=loongarch64 -mfpu=64 -mabi=lp64d -mlasx" } */
+
+#include "pr112476-1.c"


Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding




On 2023/11/13 9:11, juzhe.zh...@rivai.ai wrote:

Ah, nice!  How configurable are the bit ranges?

I think Lehua's patch is configurable for bit ranges.
Since his patch allow target flexible tracking subreg livenesss 
according to REGMODE_NATURAL_SIZE


+/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
+bool
+need_track_subreg (int regno, machine_mode reg_mode)
+{
+  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
+  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
+  return maybe_gt (total_size, natural_size)
+&& multiple_p (total_size, natural_size)
+&& regno >= FIRST_PSEUDO_REGISTER;
+}

It depends on how targets configure REGMODE_NATURAL_SIZE target hook.

If we return QImode size, his patch is enable tracking bit ranges 7 bits 
subreg.


Yes, the current subreg_ranges class provides 
remove_range/add_range/remove_ranges/add_ranges interfaces to modify 
ranges. Each subreg_range contains start and end fields representing the 
range [start, end). For live_subreg problem, the value returned by 
REGMODE_NATURAL_SIZE is used as the unit, for bit track like Jeff's 
side, it can be used bit as the unit.


--
Best,
Lehua (RiVAI)
lehua.d...@rivai.ai



Re: [PATCH v2] LoongArch: Optimize single-used address with -mexplicit-relocs=auto for fld/fst

2023-11-12 Thread chenglulu



在 2023/11/11 下午6:58, Xi Ruoyao 写道:

fld and fst have same address mode as ld.w and st.w, so the same
optimization as r14-4851 should be applied for them too.

gcc/ChangeLog:

* config/loongarch/loongarch.md (LD_AT_LEAST_32_BIT): New mode
iterator.
(ST_ANY): New mode iterator.
(define_peephole2): Use LD_AT_LEAST_32_BIT instead of GPR and
ST_ANY instead of QHWD for applicable patterns.
---

v1: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635278.html

Changes from v1: take the advantage of r14-5329 "Allow md iterators to
include other iterators" to simplify LD_AT_LEAST_32_BIT and ST_ANY.

Ok for trunk?


LGTM!

Thanks!



  gcc/config/loongarch/loongarch.md | 38 +++
  1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index 4dd716e1941..22814a3679c 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -400,6 +400,14 @@ (define_mode_iterator SPLITF
 (DI "!TARGET_64BIT && TARGET_DOUBLE_FLOAT")
 (TF "TARGET_64BIT && TARGET_DOUBLE_FLOAT")])
  
+;; A mode for anything with 32 bits or more, and able to be loaded with

+;; the same addressing mode as ld.w.
+(define_mode_iterator LD_AT_LEAST_32_BIT [GPR ANYF])
+
+;; A mode for anything able to be stored with the same addressing mode as
+;; st.w.
+(define_mode_iterator ST_ANY [QHWD ANYF])
+
  ;; In GPR templates, a string like "mul." will expand to "mul.w" in the
  ;; 32-bit version and "mul.d" in the 64-bit version.
  (define_mode_attr d [(SI "w") (DI "d")])
@@ -3785,13 +3793,14 @@ (define_insn "loongarch_crcc_w__w"
  (define_peephole2
[(set (match_operand:P 0 "register_operand")
(match_operand:P 1 "symbolic_pcrel_operand"))
-   (set (match_operand:GPR 2 "register_operand")
-   (mem:GPR (match_dup 0)))]
+   (set (match_operand:LD_AT_LEAST_32_BIT 2 "register_operand")
+   (mem:LD_AT_LEAST_32_BIT (match_dup 0)))]
"la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
 && (TARGET_CMODEL_NORMAL || TARGET_CMODEL_MEDIUM) \
 && (peep2_reg_dead_p (2, operands[0]) \
 || REGNO (operands[0]) == REGNO (operands[2]))"
-  [(set (match_dup 2) (mem:GPR (lo_sum:P (match_dup 0) (match_dup 1]
+  [(set (match_dup 2)
+   (mem:LD_AT_LEAST_32_BIT (lo_sum:P (match_dup 0) (match_dup 1]
{
  emit_insn (gen_pcalau12i_gr (operands[0], operands[1]));
})
@@ -3799,14 +3808,15 @@ (define_peephole2
  (define_peephole2
[(set (match_operand:P 0 "register_operand")
(match_operand:P 1 "symbolic_pcrel_operand"))
-   (set (match_operand:GPR 2 "register_operand")
-   (mem:GPR (plus (match_dup 0)
-  (match_operand 3 "const_int_operand"]
+   (set (match_operand:LD_AT_LEAST_32_BIT 2 "register_operand")
+   (mem:LD_AT_LEAST_32_BIT (plus (match_dup 0)
+   (match_operand 3 "const_int_operand"]
"la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
 && (TARGET_CMODEL_NORMAL || TARGET_CMODEL_MEDIUM) \
 && (peep2_reg_dead_p (2, operands[0]) \
 || REGNO (operands[0]) == REGNO (operands[2]))"
-  [(set (match_dup 2) (mem:GPR (lo_sum:P (match_dup 0) (match_dup 1]
+  [(set (match_dup 2)
+   (mem:LD_AT_LEAST_32_BIT (lo_sum:P (match_dup 0) (match_dup 1]
{
  operands[1] = plus_constant (Pmode, operands[1], INTVAL (operands[3]));
  emit_insn (gen_pcalau12i_gr (operands[0], operands[1]));
@@ -3850,13 +3860,13 @@ (define_peephole2
  (define_peephole2
[(set (match_operand:P 0 "register_operand")
(match_operand:P 1 "symbolic_pcrel_operand"))
-   (set (mem:QHWD (match_dup 0))
-   (match_operand:QHWD 2 "register_operand"))]
+   (set (mem:ST_ANY (match_dup 0))
+   (match_operand:ST_ANY 2 "register_operand"))]
"la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
 && (TARGET_CMODEL_NORMAL || TARGET_CMODEL_MEDIUM) \
 && (peep2_reg_dead_p (2, operands[0])) \
 && REGNO (operands[0]) != REGNO (operands[2])"
-  [(set (mem:QHWD (lo_sum:P (match_dup 0) (match_dup 1))) (match_dup 2))]
+  [(set (mem:ST_ANY (lo_sum:P (match_dup 0) (match_dup 1))) (match_dup 2))]
{
  emit_insn (gen_pcalau12i_gr (operands[0], operands[1]));
})
@@ -3864,14 +3874,14 @@ (define_peephole2
  (define_peephole2
[(set (match_operand:P 0 "register_operand")
(match_operand:P 1 "symbolic_pcrel_operand"))
-   (set (mem:QHWD (plus (match_dup 0)
-   (match_operand 3 "const_int_operand")))
-   (match_operand:QHWD 2 "register_operand"))]
+   (set (mem:ST_ANY (plus (match_dup 0)
+ (match_operand 3 "const_int_operand")))
+   (match_operand:ST_ANY 2 "register_operand"))]
"la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
 && (TARGET_CMODEL_NORMAL || TARGET_CMODEL_MEDIUM) \
 && (peep2_reg_dead_p (2, operands[0])) \
 && REGNO (operands[0]) != REGNO (operands[2])"
-  [(set (mem:QHWD (

RE: [PATCH v2] DSE: Allow vector type for get_stored_val when read < store

2023-11-12 Thread Li, Pan2
Update v4 in below link, please help to ignore v3.

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636216.html

Sorry for inconvenience.

Pan

-Original Message-
From: Li, Pan2 
Sent: Sunday, November 12, 2023 10:31 AM
To: Richard Sandiford ; Jeff Law 

Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; Wang, Yanzhang 
; kito.ch...@gmail.com; richard.guent...@gmail.com
Subject: RE: [PATCH v2] DSE: Allow vector type for get_stored_val when read < 
store

Thanks Richard S and Jeff for comments.

> Did you want to use known_le so that you'd pick up the case when the two 
> modes are the same size?  Or was known_lt the test you really wanted 
> (and if so, why).

Take known_lt in v2 due to consideration that leave the equal go to original 
code path.
Just have a try for known_le and got sorts of ICE when test, I bet it may be 
related to the
latent bug as Richard S mentioned.

> instead.  Alternatively, we could remove the is_constant condition
> and fix PR87815 in a different way, e.g. by protecting the
> smallest_int_mode_for_size with a tighter condition.  That might
> allow a similar DSE optimisation to this patch for nonzero offsets,
> thanks to:

Thus, looks like we should fix the PR87815 from the way suggested by Richard S, 
before
we take known_le for vector here.

I will have a try soon and keep you posted.

Pan

-Original Message-
From: Richard Sandiford  
Sent: Saturday, November 11, 2023 11:23 PM
To: Jeff Law 
Cc: Li, Pan2 ; gcc-patches@gcc.gnu.org; 
juzhe.zh...@rivai.ai; Wang, Yanzhang ; 
kito.ch...@gmail.com; richard.guent...@gmail.com
Subject: Re: [PATCH v2] DSE: Allow vector type for get_stored_val when read < 
store

Jeff Law  writes:
> On 11/8/23 23:08, pan2...@intel.com wrote:
>> From: Pan Li 
>> 
>> Update in v2:
>> * Move vector type support to get_stored_val.
>> 
>> Original log:
>> 
>> This patch would like to allow the vector mode in the
>> get_stored_val in the DSE. It is valid for the read
>> rtx if and only if the read bitsize is less than the
>> stored bitsize.
>> 
>> Given below example code with
>> --param=riscv-autovec-preference=fixed-vlmax.
>> 
>> vuint8m1_t test () {
>>uint8_t arr[32] = {
>>  1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
>>  1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
>>};
>> 
>>return __riscv_vle8_v_u8m1(arr, 32);
>> }
>> 
>> Before this patch:
>> test:
>>lui a5,%hi(.LANCHOR0)
>>addisp,sp,-32
>>addia5,a5,%lo(.LANCHOR0)
>>li  a3,32
>>vl2re64.v   v2,0(a5)
>>vsetvli zero,a3,e8,m1,ta,ma
>>vs2r.v  v2,0(sp) <== Unnecessary store to stack
>>vle8.v  v1,0(sp) <== Ditto
>>vs1r.v  v1,0(a0)
>>addisp,sp,32
>>jr  ra
>> 
>> After this patch:
>> test:
>>lui a5,%hi(.LANCHOR0)
>>addia5,a5,%lo(.LANCHOR0)
>>li  a4,32
>>addisp,sp,-32
>>vsetvli zero,a4,e8,m1,ta,ma
>>vle8.v  v1,0(a5)
>>vs1r.v  v1,0(a0)
>>addisp,sp,32
>>jr  ra
>> 
>> Below tests are passed within this patch:
>> 
>> * The x86 bootstrap and regression test.
>> * The aarch64 regression test.
>> * The risc-v regression test.
>> 
>>  PR target/111720
>> 
>> gcc/ChangeLog:
>> 
>>  * dse.cc (get_stored_val): Allow vector mode if the read
>>  bitsize is less than stored bitsize.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>>  * gcc.target/riscv/rvv/base/pr111720-0.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-1.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-10.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-2.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-3.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-4.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-5.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-6.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-7.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-8.c: New test.
>>  * gcc.target/riscv/rvv/base/pr111720-9.c: New test.
> We're always getting the lowpart here AFAICT and it appears that all the 
> right thing should happen if gen_lowpart_common fails (it returns NULL, 
> which bubbles up and is the right return value from get_stored_val if it 
> can't be optimized).

Yeah, we should always be operating on the lowpart, but it looks
like there's a latent bug.  This check:

  if (gap.is_constant () && maybe_ne (gap, 0))
{
  ...
}
  else ...

means that we ignore the gap if it's a nonzero runtime value.
I guess it should be:

  if (maybe_ne (gap, 0))
{
  if (!gap.is_constant ())
return NULL_RTX;
  ...
}

instead.  Alternatively, we could remove the is_constant condition
and fix PR87815 in a different way, e.g. by protecting the
smallest_int_mode_for_size with a tighter condition.  That might
allow a similar DSE optimisation to this patch for nonzero offsets,
thanks to:

  if (multiple_p (shift, GET_MODE_BITSIZE (new_mode))
 

[PATCH v4] DSE: Allow vector type for get_stored_val when read < store

2023-11-12 Thread pan2 . li
From: Pan Li 

Update in v4:
* Merge upstream and removed some independent changes.

Update in v3:
* Take known_le instead of known_lt for vector size.
* Return NULL_RTX when gap is not equal 0 and not constant.

Update in v2:
* Move vector type support to get_stored_val.

Original log:

This patch would like to allow the vector mode in the
get_stored_val in the DSE. It is valid for the read
rtx if and only if the read bitsize is less than the
stored bitsize.

Given below example code with
--param=riscv-autovec-preference=fixed-vlmax.

vuint8m1_t test () {
  uint8_t arr[32] = {
1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
  };

  return __riscv_vle8_v_u8m1(arr, 32);
}

Before this patch:
test:
  lui a5,%hi(.LANCHOR0)
  addisp,sp,-32
  addia5,a5,%lo(.LANCHOR0)
  li  a3,32
  vl2re64.v   v2,0(a5)
  vsetvli zero,a3,e8,m1,ta,ma
  vs2r.v  v2,0(sp) <== Unnecessary store to stack
  vle8.v  v1,0(sp) <== Ditto
  vs1r.v  v1,0(a0)
  addisp,sp,32
  jr  ra

After this patch:
test:
  lui a5,%hi(.LANCHOR0)
  addia5,a5,%lo(.LANCHOR0)
  li  a4,32
  addisp,sp,-32
  vsetvli zero,a4,e8,m1,ta,ma
  vle8.v  v1,0(a5)
  vs1r.v  v1,0(a0)
  addisp,sp,32
  jr  ra

Below tests are passed within this patch:
* The risc-v regression test.
* The x86 bootstrap and regression test.
* The aarch64 regression test.

PR target/111720

gcc/ChangeLog:

* dse.cc (get_stored_val): Allow vector mode if read size is
less than or equal to stored size.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr111720-0.c: New test.
* gcc.target/riscv/rvv/base/pr111720-1.c: New test.
* gcc.target/riscv/rvv/base/pr111720-10.c: New test.
* gcc.target/riscv/rvv/base/pr111720-2.c: New test.
* gcc.target/riscv/rvv/base/pr111720-3.c: New test.
* gcc.target/riscv/rvv/base/pr111720-4.c: New test.
* gcc.target/riscv/rvv/base/pr111720-5.c: New test.
* gcc.target/riscv/rvv/base/pr111720-6.c: New test.
* gcc.target/riscv/rvv/base/pr111720-7.c: New test.
* gcc.target/riscv/rvv/base/pr111720-8.c: New test.
* gcc.target/riscv/rvv/base/pr111720-9.c: New test.

Signed-off-by: Pan Li 
---
 gcc/dse.cc|  9 +++-
 .../gcc.target/riscv/rvv/base/pr111720-0.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-1.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-10.c   | 18 
 .../gcc.target/riscv/rvv/base/pr111720-2.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-3.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-4.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-5.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-6.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-7.c| 21 +++
 .../gcc.target/riscv/rvv/base/pr111720-8.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-9.c| 15 +
 12 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-0.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-10.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-4.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-5.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-6.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-7.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-8.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-9.c

diff --git a/gcc/dse.cc b/gcc/dse.cc
index 1a85dae1f8c..40c4c29d07e 100644
--- a/gcc/dse.cc
+++ b/gcc/dse.cc
@@ -1900,8 +1900,11 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
   else
 gap = read_offset - store_info->offset;
 
-  if (gap.is_constant () && maybe_ne (gap, 0))
+  if (maybe_ne (gap, 0))
 {
+  if (!gap.is_constant ())
+   return NULL_RTX;
+
   poly_int64 shift = gap * BITS_PER_UNIT;
   poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
   read_reg = find_shift_sequence (access_size, store_info, read_mode,
@@ -1940,6 +1943,10 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
   || GET_MODE_CLASS (read_mode) != GET_MODE_CLASS (store_mode)))
 read_reg = extract_low_bits (read_mode, store_mode,
 copy_rtx (store_info->const_rhs));
+  else if (VECTOR_MODE_P (read_mode) && VECTOR_MODE_P (store_mode)
+&& known_le (GET_MODE_BITSIZE (read_mode), GET_MODE_BITSIZE (store_mode))
+

RE: [PATCH v1] RISC-V: Fix RVV dynamic frm tests failure

2023-11-12 Thread Li, Pan2
Committed, thanks Juzhe.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Monday, November 13, 2023 11:11 AM
To: Li, Pan2 ; gcc-patches 
Cc: Li, Pan2 ; Wang, Yanzhang ; 
kito.cheng 
Subject: Re: [PATCH v1] RISC-V: Fix RVV dynamic frm tests failure

OK


juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-11-13 11:10
To: gcc-patches
CC: juzhe.zhong; 
pan2.li; 
yanzhang.wang; 
kito.cheng
Subject: [PATCH v1] RISC-V: Fix RVV dynamic frm tests failure
From: Pan Li mailto:pan2...@intel.com>>

The hancement of mode-switching performs some optimization when
emit the frm backup insn, some redudant fsrm insns are removed
for the following test cases.

This patch would like to adjust the asm check for above optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c: Adjust
the asm checker.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c: Ditto.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c  | 2 +-
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c  | 2 +-
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c  | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
index 8c67d4bba81..f33f303c0cb 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
index 7ac9c960e65..cc0fb556da3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
index c5f96bc45c0..c5c3408be30 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 2 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
--
2.34.1




Re: [PATCH v1] RISC-V: Fix RVV dynamic frm tests failure

2023-11-12 Thread juzhe.zh...@rivai.ai
OK



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-11-13 11:10
To: gcc-patches
CC: juzhe.zhong; pan2.li; yanzhang.wang; kito.cheng
Subject: [PATCH v1] RISC-V: Fix RVV dynamic frm tests failure
From: Pan Li 
 
The hancement of mode-switching performs some optimization when
emit the frm backup insn, some redudant fsrm insns are removed
for the following test cases.
 
This patch would like to adjust the asm check for above optimization.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c: Adjust
the asm checker.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c: Ditto.
 
Signed-off-by: Pan Li 
---
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c  | 2 +-
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c  | 2 +-
.../gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c  | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
 
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
index 8c67d4bba81..f33f303c0cb 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
index 7ac9c960e65..cc0fb556da3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
index c5f96bc45c0..c5c3408be30 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
/* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
/* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
/* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 2 } } */
/* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
-- 
2.34.1
 
 


[PATCH v1] RISC-V: Fix RVV dynamic frm tests failure

2023-11-12 Thread pan2 . li
From: Pan Li 

The hancement of mode-switching performs some optimization when
emit the frm backup insn, some redudant fsrm insns are removed
for the following test cases.

This patch would like to adjust the asm check for above optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c: Adjust
the asm checker.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c: Ditto.

Signed-off-by: Pan Li 
---
 .../gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c  | 2 +-
 .../gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c  | 2 +-
 .../gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
index 8c67d4bba81..f33f303c0cb 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
 
 /* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
 /* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
 /* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
 /* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
index 7ac9c960e65..cc0fb556da3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
 
 /* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
 /* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
 /* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 1 } } */
 /* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c 
b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
index c5f96bc45c0..c5c3408be30 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c
@@ -33,6 +33,6 @@ test_float_point_dynamic_frm (vfloat32m1_t op1, vfloat32m1_t 
op2,
 
 /* { dg-final { scan-assembler-times 
{vfadd\.v[vf]\s+v[0-9]+,\s*v[0-9]+,\s*[fav]+[0-9]+} 4 } } */
 /* { dg-final { scan-assembler-times {frrm\s+[axs][0-9]+} 3 } } */
-/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 4 } } */
+/* { dg-final { scan-assembler-times {fsrm\s+[axs][0-9]+} 2 } } */
 /* { dg-final { scan-assembler-times {fsrmi\s+[01234]} 2 } } */
 /* { dg-final { scan-assembler-not {fsrmi\s+[axs][0-9]+,\s*[01234]} } } */
-- 
2.34.1



Re: [PATCH v2] In the pipeline, USE or CLOBBER should delay execution if it starts a new live range.

2023-11-12 Thread Jeff Law




On 11/12/23 19:16, Jin Ma wrote:


Unfortunately this patch has triggered a bootstrap comparison failure on
loongarch64-linux-gnu: https://gcc.gnu.org/PR112497.

It's also causing simple build failures on other targets.  For example
c6x-elf aborts when compiling gcc.c-torture/execute/pr82210 (and others)
with -O2 with that patch applied.

I've reverted it for now.  I'm not going to have time to investigate
this week.


I'm sorry to have caused this and had a bad effect. This patch has
been a long time since I verified it, so I don't know what happened, I
will check it out :)
It happens to all of us.  It's been reverted, so it's not causing anyone 
problems anymore.   We also know that various ports have sensitivity to 
the patch, so we can do deeper testing on it once you think it's ready 
to go again.


Jeff


Re: [PATCH v2] In the pipeline, USE or CLOBBER should delay execution if it starts a new live range.

2023-11-12 Thread Jin Ma
> > 
> > Unfortunately this patch has triggered a bootstrap comparison failure on
> > loongarch64-linux-gnu: https://gcc.gnu.org/PR112497.
> It's also causing simple build failures on other targets.  For example 
> c6x-elf aborts when compiling gcc.c-torture/execute/pr82210 (and others) 
> with -O2 with that patch applied.
> 
> I've reverted it for now.  I'm not going to have time to investigate 
> this week.

I'm sorry to have caused this and had a bad effect. This patch has
been a long time since I verified it, so I don't know what happened, I
will check it out :)

BR
Jin

> Jeff
> >

RE: [PATCH] Avoid generate vblendps with ymm16+

2023-11-12 Thread Hu, Lin1
On Saturday, November 11, 2023 4:11 AM,  Jakub Jelinek  wrote:
> On Thu, Nov 09, 2023 at 03:27:11PM +0800, Hongtao Liu wrote:
> > On Thu, Nov 9, 2023 at 3:15 PM Hu, Lin1  wrote:
> > >
> > > This patch aims to avoid generate vblendps with ymm16+, And have
> > > bootstrapped and tested on x86_64-pc-linux-gnu{-m32,-m64}. Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/112435
> > > * config/i386/sse.md: Adding constraints to restrict the 
> > > generation of
> > > vblendps.
> > It should be "Don't output vblendps when evex sse reg or gpr32 is involved."
> > Others LGTM.
> 
> I've missed this patch, so wrote my own today, and am wondering
> 
> 1) if it isn't better to use separate alternative instead of
>x86_evex_reg_mentioned_p, like in the patch below
> 2) why do you need the last two hunks in sse.md, both avx2_permv2ti and
>*avx_vperm2f128_nozero insns only use x in constraints, never v,
>so x86_evex_reg_mentioned_p ought to be always false there
>

Yes, I think your method is better. For the second problem, I didn't focus on 
the constraints when I solved this problem. I did learn a good thought. Feel 
free to upstream this patch.

BRs,
Lin
 
>
> Here is the untested patch, of course you have more testcases (though, I 
> think it
> is better to test dg-do assemble with avx512vl target rather than dg-do 
> compile
> and scan the assembler, after all, the problem was that it didn't assemble).
> 
> 2023-11-10  Jakub Jelinek  
> 
>   PR target/112435
>   * config/i386/sse.md
> (avx512vl_shuf_32x4_1,
>   avx512dq_shuf_64x2_1):
> Add
>   alternative with just x instead of v constraints and use vblendps
>   as optimization only with that alternative.
> 
>   * gcc.target/i386/avx512vl-pr112435.c: New test.
> 
> --- gcc/config/i386/sse.md.jj 2023-11-09 09:04:18.616543403 +0100
> +++ gcc/config/i386/sse.md2023-11-10 15:56:44.138499931 +0100
> @@ -19235,11 +19235,11 @@ (define_expand "avx512dq_shuf_  })
> 
>  (define_insn
> "avx512dq_shuf_64x2_1"
> -  [(set (match_operand:VI8F_256 0 "register_operand" "=v")
> +  [(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
>   (vec_select:VI8F_256
> (vec_concat:
> - (match_operand:VI8F_256 1 "register_operand" "v")
> - (match_operand:VI8F_256 2 "nonimmediate_operand" "vm"))
> + (match_operand:VI8F_256 1 "register_operand" "x,v")
> + (match_operand:VI8F_256 2 "nonimmediate_operand" "xm,vm"))
> (parallel [(match_operand 3 "const_0_to_3_operand")
>(match_operand 4 "const_0_to_3_operand")
>(match_operand 5 "const_4_to_7_operand") @@ -19254,7
> +19254,7 @@ (define_insn "avx512dq_shu
>mask = INTVAL (operands[3]) / 2;
>mask |= (INTVAL (operands[5]) - 4) / 2 << 1;
>operands[3] = GEN_INT (mask);
> -  if (INTVAL (operands[3]) == 2 && !)
> +  if (INTVAL (operands[3]) == 2 && ! && which_alternative
> + == 0)
>  return "vblendps\t{$240, %2, %1, %0|%0, %1, %2, 240}";
>return
> "vshuf64x2\t{%3, %2, %1, %0|%0 d7>, %1, %2, %3}";  } @@ -19386,11 +19386,11 @@ (define_expand
> "avx512vl_shuf_  })
> 
>  (define_insn "avx512vl_shuf_32x4_1"
> -  [(set (match_operand:VI4F_256 0 "register_operand" "=v")
> +  [(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
>   (vec_select:VI4F_256
> (vec_concat:
> - (match_operand:VI4F_256 1 "register_operand" "v")
> - (match_operand:VI4F_256 2 "nonimmediate_operand" "vm"))
> + (match_operand:VI4F_256 1 "register_operand" "x,v")
> + (match_operand:VI4F_256 2 "nonimmediate_operand" "xm,vm"))
> (parallel [(match_operand 3 "const_0_to_7_operand")
>(match_operand 4 "const_0_to_7_operand")
>(match_operand 5 "const_0_to_7_operand") @@ -19414,7
> +19414,7 @@ (define_insn "avx512vl_shuf_mask |= (INTVAL (operands[7]) - 8) / 4 << 1;
>operands[3] = GEN_INT (mask);
> 
> -  if (INTVAL (operands[3]) == 2 && !)
> +  if (INTVAL (operands[3]) == 2 && ! && which_alternative
> + == 0)
>  return "vblendps\t{$240, %2, %1, %0|%0, %1, %2, 240}";
> 
>return
> "vshuf32x4\t{%3, %2, %1, %0|%0 nd11>, %1, %2, %3}";
> --- gcc/testsuite/gcc.target/i386/avx512vl-pr112435.c.jj  2023-11-10
> 16:04:21.708046771 +0100
> +++ gcc/testsuite/gcc.target/i386/avx512vl-pr112435.c 2023-11-10
> 16:03:51.053479094 +0100
> @@ -0,0 +1,13 @@
> +/* PR target/112435 */
> +/* { dg-do assemble { target { avx512vl && { ! ia32 } } } } */
> +/* { dg-options "-mavx512vl -O2" } */
> +
> +#include 
> +
> +__m256i
> +foo (__m256i a, __m256i b)
> +{
> +  register __m256i c __asm__("ymm16") = a;
> +  asm ("" : "+v" (c));
> +  return _mm256_shuffle_i32x4 (c, b, 2); }
> 
>   Jakub



Re: [PATCH v1] RISC-V: Support FP l/ll round and rint HF mode autovec

2023-11-12 Thread juzhe.zh...@rivai.ai
LGTM.



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-11-12 21:47
To: gcc-patches
CC: juzhe.zhong; pan2.li; yanzhang.wang; kito.cheng
Subject: [PATCH v1] RISC-V: Support FP l/ll round and rint HF mode autovec
From: Pan Li 
 
This patch would like to support the FP below API auto vectorization
with different type size
 
++---+--+
| API| RV64  | RV32 |
++---+--+
| lrintf16   | HF => DI  | HF => SI |
| llrintf16  | HF => DI  | HF => DI |
| lroundf16  | HF => DI  | HF => SI |
| llroundf16 | HF => DI  | HF => DI |
++---+--+
 
Given below code:
void
test_lrintf16 (long *out, _Float16 *in, int count)
{
  for (unsigned i = 0; i < count; i++)
out[i] = __builtin_lrintf16 (in[i]);
}
 
Before this patch:
.L3:
  lhu a5,0(s0)
  addis0,s0,2
  addis1,s1,8
  fmv.s.x fa0,a5
  calllrintf16
  sd  a0,-8(s1)
  bne s0,s2,.L3
 
After this patch:
.L3:
  vsetvli a5,a2,e16,mf4,ta,ma
  vle16.v v1,0(a1)
  vfwcvt.f.f.vv2,v1
  vsetvli zero,zero,e32,mf2,ta,ma
  vfwcvt.x.f.vv1,v2
  vse64.v v1,0(a0)
  sllia4,a5,1
  add a1,a1,a4
  sllia4,a5,3
  add a0,a0,a4
  sub a2,a2,a5
  bne a2,zero,.L3
 
gcc/ChangeLog:
 
* config/riscv/autovec.md: Add bridge mode to lrint and lround
pattern.
* config/riscv/riscv-protos.h (expand_vec_lrint): Add new arg
bridge machine mode.
(expand_vec_lround): Ditto.
* config/riscv/riscv-v.cc (emit_vec_widden_cvt_f_f): New helper
func impl to emit vfwcvt.f.f.
(emit_vec_rounding_to_integer): Handle the HF to DI rounding
with the bridge mode.
(expand_vec_lrint): Reorder the args.
(expand_vec_lround): Ditto.
(expand_vec_lceil): Ditto.
(expand_vec_lfloor): Ditto.
* config/riscv/vector-iterators.md: Add vector HFmode and bridge
mode for converting to DI.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/unop/math-llrintf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-llroundf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-llrintf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-llroundf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv64-0.c: New test.
 
Signed-off-by: Pan Li 
---
gcc/config/riscv/autovec.md   | 17 ++--
gcc/config/riscv/riscv-protos.h   |  4 +-
gcc/config/riscv/riscv-v.cc   | 51 
gcc/config/riscv/vector-iterators.md  | 82 ++-
.../riscv/rvv/autovec/unop/math-llrintf16-0.c | 14 
.../rvv/autovec/unop/math-llroundf16-0.c  | 21 +
.../rvv/autovec/unop/math-lrintf16-rv32-0.c   | 13 +++
.../rvv/autovec/unop/math-lrintf16-rv64-0.c   | 15 
.../rvv/autovec/unop/math-lroundf16-rv32-0.c  | 18 
.../rvv/autovec/unop/math-lroundf16-rv64-0.c  | 20 +
.../riscv/rvv/autovec/vls/math-llrintf16-0.c  | 28 +++
.../riscv/rvv/autovec/vls/math-llroundf16-0.c | 28 +++
.../rvv/autovec/vls/math-lrintf16-rv32-0.c| 27 ++
.../rvv/autovec/vls/math-lrintf16-rv64-0.c| 28 +++
.../rvv/autovec/vls/math-lroundf16-rv32-0.c   | 27 ++
.../rvv/autovec/vls/math-lroundf16-rv64-0.c   | 28 +++
16 files changed, 397 insertions(+), 24 deletions(-)
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-llrintf16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-llroundf16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv32-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv64-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv32-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv64-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-llrintf16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-llroundf16-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv32-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv64-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv32-0.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv64-0.c
 
diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 868b47c8af7..80e41af6334 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -2455,14 +2455,13 @@ (define_expand "roundeve

Re: Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread juzhe.zh...@rivai.ai
>> Ah, nice!  How configurable are the bit ranges?
I think Lehua's patch is configurable for bit ranges.
Since his patch allow target flexible tracking subreg livenesss according to 
REGMODE_NATURAL_SIZE

+/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
+bool
+need_track_subreg (int regno, machine_mode reg_mode)
+{
+  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
+  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
+  return maybe_gt (total_size, natural_size)
+&& multiple_p (total_size, natural_size)
+&& regno >= FIRST_PSEUDO_REGISTER;
+}
It depends on how targets configure REGMODE_NATURAL_SIZE target hook.

If we return QImode size, his patch is enable tracking bit ranges 7 bits subreg.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-11-12 19:53
To: 钟居哲
CC: Jeff Law; 丁乐华; gcc-patches; vmakarov
Subject: Re: [PATCH 0/7] ira/lra: Support subreg coalesce
钟居哲  writes:
> Hi, Richard.
>
>>> Maybe dead lanes are better tracked at the gimple level though, not sure.
>>> (But AArch64 might need to lower lane operations more than it does now if
>>> we want gimple to handle it.)
>
> We were trying to address such issue at GIMPLE leve at the beginning.
> Tracking subreg-lanes of tuple type may be enough for aarch64 since aarch64 
> only tuple types.
> However, for RVV, that's not enough to address all issues.
> Consider this following situation:
> https://godbolt.org/z/fhTvEjvr8 
>
> You can see comparing with LLVM, GCC has so many redundant mov instructions 
> "vmv1r.v".
> Since GCC is not able to tracking subreg liveness, wheras LLVM can.
>
> The reason why tracking sub-lanes in GIMPLE can not address these redundant 
> move issues for RVV:
>
> 1. RVV has tuple type like "vint8m1x2_t" which is totoally the same as 
> aarch64 "svint8x1_t".
> It used by segment load/store which is similiar instruction "ld2r" 
> instruction in ARM SVE (vec_load_lanes/vec_store_lanes)
> Support sub-lanes tracking in GIMPLE can fix this situation for both RVV 
> and ARM SVE.
> 
> 2. However, we are not having "vint8m1x2_t", we also have "vint8m2_t" (LMUL 
> =2) which also occupies 2 regsiters
> which is not tuple type, instead, it is simple vector type. Such type is 
> used by all simple operations.
> For example, "vadd" with vint8m1_t is doing PLUS operation on single 
> vector registers, wheras same
> instruction "vadd“ with vint8m2_t is dong PLUS operation on 2 vector 
> registers.  Such type we can't
> define them as tuple type for following reasons:
> 1). we also have tuple type for LMUL > 1, for example, we also have 
> "vint8m2x2_t" has tuple type.
>  If we define "vint8m2_t" as tuple type, How about "vint8m2x2_t" ? , 
> Tuple type with tuple or
>  Array with array ? It makes type so strange.
> 2). RVV instrinsic doc define vint8m2x2_t as tuple type, but vint8m2_t 
> not tuple type. We are not able
>  to change the documents.
> 3). Clang has supported RVV intrinsics 3 years ago, vint8m2_t is not 
> tuple type for 3 years and widely
>  used, changing type definition will destroy ecosystem.  So for 
> compability, we are not able define
>  LMUL > 1 as tuple type.
>
> For these reasons, we should be able to access highpart of vint8m2_t and 
> lowpart of vint8m2_t, we provide
> vget to generate subreg access of the vector mode.
>
> So, at the discussion stage, we decided to address subpart access of vector 
> mode in more generic way,
> which is support subreg liveness tracking in RTL level. So that it can not 
> only address issues happens on ARM SVE,
> but also address issues for LMUL > 1.
>
> 3. After we decided to support subreg liveness tracking in RTL, we study LLVM.
> Actually, LLVM has a standalone PASS right before their linear scan RA 
> (greedy) call register coalescer.
> So, the first draft of our solution is supporting register coalescing 
> before RA which is opened source:
> riscv-gcc/gcc/ira-coalesce.cc at riscv-gcc-rvv-next · 
> riscv-collab/riscv-gcc (github.com)
> by simulating LLVM solution. However, we don't think such solution is 
> elegant and we have consulted
> Vlad.  Vlad suggested we should enhance IRA/LRA with subreg liveness 
> tracking which turns to be
> more reasonable and elegant approach. 
>
> So, after Lehua several experiments and investigations, he dedicate himself 
> produce this series of patches.
> And we think Lehua's approach should be generic and optimal solution to fix 
> this subreg generic problems.
 
Ah, sorry, I caused a misunderstanding.  In the message quoted above,
I'd moved on from talking about tracking liveness of vectors in a tuple.
I was instead talking about tracking the liveness of individual lanes
in a single vector.
 
I was responding to Jeff's description of the bit-level liveness tracking
pass.  That pass solves a generic issue: redundant sign and zero extensions.
But it sounded like it could also be reused 

Re: [PATCH V2] VECT: Support mask_len_strided_load/mask_len_strided_store in loop vectorize

2023-11-12 Thread juzhe.zh...@rivai.ai
Hi. Ping this patch which is last optab pattern for RVV support.

The mask_len_strided_load/mask_len_strided_store document has been approved:

https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635103.html 

Bootstrap on X86 and regtest no regression.
Tested on aarch64 no regression.
Tested on RISC-V no regression.


juzhe.zh...@rivai.ai
 
From: Juzhe-Zhong
Date: 2023-11-06 14:55
To: gcc-patches
CC: richard.sandiford; rguenther; Juzhe-Zhong
Subject: [PATCH V2] VECT: Support mask_len_strided_load/mask_len_strided_store 
in loop vectorize
This patch adds strided load/store support on loop vectorizer depending on 
STMT_VINFO_STRIDED_P.
 
Bootstrap and regression on X86 passed.
 
Ok for trunk ?
 
gcc/ChangeLog:
 
* internal-fn.cc (strided_load_direct): New function.
(strided_store_direct): Ditto.
(expand_strided_store_optab_fn): Ditto.
(expand_scatter_store_optab_fn): Add strided store.
(expand_strided_load_optab_fn): New function.
(expand_gather_load_optab_fn): Add strided load.
(direct_strided_load_optab_supported_p): New function.
(direct_strided_store_optab_supported_p): Ditto.
(internal_load_fn_p): Add strided load.
(internal_strided_fn_p): New function.
(internal_fn_len_index): Add strided load/store.
(internal_fn_mask_index): Ditto.
(internal_fn_stored_value_index): Add strided store.
(internal_strided_fn_supported_p): New function.
* internal-fn.def (MASK_LEN_STRIDED_LOAD): New IFN.
(MASK_LEN_STRIDED_STORE): Ditto.
* internal-fn.h (internal_strided_fn_p): New function.
(internal_strided_fn_supported_p): Ditto.
* optabs-query.cc (supports_vec_gather_load_p): Add strided load.
(supports_vec_scatter_store_p): Add strided store.
* optabs-query.h (supports_vec_gather_load_p): Add strided load.
(supports_vec_scatter_store_p): Add strided store.
* tree-vect-data-refs.cc (vect_prune_runtime_alias_test_list): Add strided 
load/store.
(vect_gather_scatter_fn_p): Ditto.
(vect_check_gather_scatter): Ditto.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors): Ditto.
(vect_truncate_gather_scatter_offset): Ditto.
(vect_use_strided_gather_scatters_p): Ditto.
(vect_get_strided_load_store_ops): Ditto.
(vectorizable_store): Ditto.
(vectorizable_load): Ditto.
* tree-vectorizer.h (vect_gather_scatter_fn_p): Ditto.
 
---
gcc/internal-fn.cc | 101 -
gcc/internal-fn.def|   4 ++
gcc/internal-fn.h  |   2 +
gcc/optabs-query.cc|  25 ++---
gcc/optabs-query.h |   4 +-
gcc/tree-vect-data-refs.cc |  45 +
gcc/tree-vect-stmts.cc |  65 ++--
gcc/tree-vectorizer.h  |   2 +-
8 files changed, 199 insertions(+), 49 deletions(-)
 
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index c7d3564faef..a31a65755c7 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -164,6 +164,7 @@ init_internal_fns ()
#define load_lanes_direct { -1, -1, false }
#define mask_load_lanes_direct { -1, -1, false }
#define gather_load_direct { 3, 1, false }
+#define strided_load_direct { -1, -1, false }
#define len_load_direct { -1, -1, false }
#define mask_len_load_direct { -1, 4, false }
#define mask_store_direct { 3, 2, false }
@@ -172,6 +173,7 @@ init_internal_fns ()
#define vec_cond_mask_direct { 1, 0, false }
#define vec_cond_direct { 2, 0, false }
#define scatter_store_direct { 3, 1, false }
+#define strided_store_direct { 1, 1, false }
#define len_store_direct { 3, 3, false }
#define mask_len_store_direct { 4, 5, false }
#define vec_set_direct { 3, 3, false }
@@ -3561,62 +3563,87 @@ expand_LAUNDER (internal_fn, gcall *call)
   expand_assignment (lhs, gimple_call_arg (call, 0), false);
}
+#define expand_strided_store_optab_fn expand_scatter_store_optab_fn
+
/* Expand {MASK_,}SCATTER_STORE{S,U} call CALL using optab OPTAB.  */
static void
expand_scatter_store_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
{
+  insn_code icode;
   internal_fn ifn = gimple_call_internal_fn (stmt);
   int rhs_index = internal_fn_stored_value_index (ifn);
   tree base = gimple_call_arg (stmt, 0);
   tree offset = gimple_call_arg (stmt, 1);
-  tree scale = gimple_call_arg (stmt, 2);
   tree rhs = gimple_call_arg (stmt, rhs_index);
   rtx base_rtx = expand_normal (base);
   rtx offset_rtx = expand_normal (offset);
-  HOST_WIDE_INT scale_int = tree_to_shwi (scale);
   rtx rhs_rtx = expand_normal (rhs);
   class expand_operand ops[8];
   int i = 0;
   create_address_operand (&ops[i++], base_rtx);
-  create_input_operand (&ops[i++], offset_rtx, TYPE_MODE (TREE_TYPE (offset)));
-  create_integer_operand (&ops[i++], TYPE_UNSIGNED (TREE_TYPE (offset)));
-  create_integer_operand (&ops[i++], scale_int);
+  if (internal_strided_fn_p (ifn))
+{
+  create_address_operand (&ops[i++], offset_rtx);
+  icode = direct_optab_handler (optab, TYPE_MODE (TREE_TYPE (rhs)));
+}
+  else
+{
+  tree scale = gimple_call_arg (stmt, 2);
+  HOST_WIDE_INT scale_int = tree_to_shwi (scale);
+  create_input_operand (&ops[i++], offs

Re: [PATCH v2 1/7] aarch64: Use br instead of ret for eh_return

2023-11-12 Thread Hans-Peter Nilsson
> From: Szabolcs Nagy 
> Date: Fri, 3 Nov 2023 15:36:08 +

I don't see others commenting on this patch, and you're not
mentioning this aspect, so I wonder:

>   * config/aarch64/aarch64.h (EH_RETURN_TAKEN_RTX): Define.
>   (EH_RETURN_STACKADJ_RTX): Change to R5.
>   (EH_RETURN_HANDLER_RTX): Change to R6.

Isn't this an ABI change?

(I've forgotten relevant bits of the exception machinery; if
throw and catch are always in the same object and everything
in between register-number-agnostic then the only flaw would
be not mentioning that in the commit message.)

brgds, H-P


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-11-12 Thread Sam James


Sam James  writes:

> Alexander Monakov  writes:
> [...]
>>
>> I'm very curious what you mean by "this has come up with LLVM [] too": 
>> ttbomk,
>> LLVM doesn't do such lifetime-based optimization yet, which is why compiling
>> LLVM with LLVM doesn't break it. Can you share some examples? Or do you mean
>> instances when libLLVM-miscompiled-with-GCC was linked elsewhere, and that
>> program crashed mysteriously as a result?
>>
>> Indeed this work is inspired by the LLVM incident in PR 106943.
>
> [...]
> I had some vague memories in the back of my head so I went digging
> because I enjoy this:
> [...]

I ended up stumbling on two more:

* charm (https://github.com/UIUC-PPL/charm/issues/1045)
* firebird (https://github.com/FirebirdSQL/firebird/issues/5384, starring richi)

Now I'm really done :)

> [...]
>>
>> Alexander
>
> thanks,
> sam



[PATCH] PR112380: Defend against CLOBBERs in RTX expressions in combine.cc

2023-11-12 Thread Roger Sayle

This patch addresses PR rtl-optimization/112380, an ICE-on-valid regression
where a (clobber (const_int 0)) encounters a sanity checking gcc_assert
(at line 7554) in simplify-rtx.cc.  These CLOBBERs are used internally
by GCC's combine pass much like error_mark_node is used by various
language front-ends.

The solutions are either to handle/accept these CLOBBERs through-out
(or in more places in) the middle-end's RTL optimizers, including functions
in simplify-rtx.cc that are used by passes other than combine, and/or
attempt to prevent these CLOBBERs escaping from try_combine into the
RTX/RTL stream.  The benefit of the second approach is that it actually
allows for better optimization: when try_combine fails to simplify an
expression instead of substituting a CLOBBER to avoid the instruction
pattern being recognized, noticing the CLOBBER often allows combine
to attempt alternate simplifications/transformations looking for those
that can be recognized.

This patch is provided as two alternatives.  The first is the minimal
fix to address the CLOBBER encountered in the bugzilla PR.  Assuming
this approach is the correct fix to a latent bug/liability through-out
combine.cc, the second alternative fixes many of the places that may
potentially trigger problems in future, and allows combine to attempt
more valid combinations/transformations.  These were identified
proactively by changing the "fail:" case in gen_lowpart_for_combine
to return NULL_RTX, and working through the fall-out sufficient for
x86_64 to bootstrap and regression test without new failures.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-11-12  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/112380
* combine.cc (expand_field_assignment): Check if gen_lowpart
returned a CLOBBER, and avoid calling gen_simplify_binary with
it if so.

gcc/testsuite/ChangeLog
PR rtl-optimization/112380
* gcc.dg/pr112380.c: New test case.

gcc/ChangeLog
PR rtl-optimization/112380
* combine.cc (find_split_point): Check if gen_lowpart returned
a CLOBBER.
(subst): Check if combine_simplify_rtx returned a CLOBBER.
(simplify_set): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.
(expand_field_assignment): Likewise.
(make_extraction): Check if force_to_mode returned a CLOBBER.
(force_int_to_mode): Likewise.
(simplify_and_const_int_1): Check if VAROP is a CLOBBER, after
call to force_to_mode (and before).
(simplify_comparison): Check if force_to_mode returned a CLOBBER.
Check if gen_lowpart returned a CLOBBER.

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 6344cd3..f2c64a9 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -7466,6 +7466,11 @@ expand_field_assignment (const_rtx x)
   if (!targetm.scalar_mode_supported_p (compute_mode))
break;
 
+  /* gen_lowpart_for_combine returns CLOBBER on failure.  */
+  rtx lowpart = gen_lowpart (compute_mode, SET_SRC (x));
+  if (GET_CODE (lowpart) == CLOBBER)
+   break;
+
   /* Now compute the equivalent expression.  Make a copy of INNER
 for the SET_DEST in case it is a MEM into which we will substitute;
 we don't want shared RTL in that case.  */
@@ -7480,9 +7485,7 @@ expand_field_assignment (const_rtx x)
 inner);
   masked = simplify_gen_binary (ASHIFT, compute_mode,
simplify_gen_binary (
- AND, compute_mode,
- gen_lowpart (compute_mode, SET_SRC (x)),
- mask),
+ AND, compute_mode, lowpart, mask),
pos);
 
   x = gen_rtx_SET (copy_rtx (inner),
diff --git a/gcc/combine.cc b/gcc/combine.cc
index 6344cd3..969eb9d 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -5157,36 +5157,37 @@ find_split_point (rtx *loc, rtx_insn *insn, bool 
set_src)
 always at least get 8-bit constants in an AND insn, which is
 true for every current RISC.  */
 
- if (unsignedp && len <= 8)
+ rtx lowpart = gen_lowpart (mode, inner);
+ if (lowpart && GET_CODE (lowpart) != CLOBBER)
{
- unsigned HOST_WIDE_INT mask
-   = (HOST_WIDE_INT_1U << len) - 1;
- rtx pos_rtx = gen_int_shift_amount (mode, pos);
- SUBST (SET_SRC (x),
-gen_rtx_AND (mode,
- gen_rtx_LSHIFTRT
- (mode, gen_lowpart (mode, inner), pos_rtx),
- gen_int_mode (mask, mode)));
-
- split = find_split_point (&SET_SRC (x), insn, true);
-

Re: [PATCH] libgccjit: Fix GGC segfault when using -flto

2023-11-12 Thread David Malcolm
On Fri, 2023-11-10 at 18:14 -0500, David Malcolm wrote:
> On Fri, 2023-11-10 at 11:02 -0500, Antoni Boucher wrote:
> > Hi.
> > This patch fixes the segfault when using -flto with libgccjit (bug
> > 111396).
> > 
> > You mentioned in bugzilla that this didn't fix the reproducer for
> > you,
> 
> Rereading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111396 it
> looks
> like all I tested back in August was your reproducer; I didn't yet
> test
> your patch.
> 
> > but it does for me.
> > At first, the test case would not pass, but running "make install"
> > made
> > it pass.
> > Not sure if this is normal.
> > 
> > Could you please check if this fixes the issue on your side as
> > well?
> > Since this patch changes files outside of gcc/jit, what tests
> > should
> > I
> > run to make sure it didn't break anything?
> 
> I'm trying your patch in my tester now.

Bootstrapped with x86_64-pc-linux-gnu/build.  No changes to non-jit
tests, but had this effect on jit.sum:

Changes to jit.sum
--

  FAIL: 9->11 (+2)
  PASS: 14827->11434 (-3393)

apparently due to:
 FAIL: test-combination.c.exe iteration 1 of 5: verify_code_accessing_bitfield: 
result is NULL
 FAIL: test-combination.c.exe killed: 997638 exp16 0 0 CHILDKILLED SIGABRT 
SIGABRT

> 
> BTW, we shouldn't add test-ggc-bugfix to since it adds options to the
> context: this would affect all the other tests.




[x86 PATCH] Improve reg pressure of double-word right-shift then truncate.

2023-11-12 Thread Roger Sayle

This patch improves register pressure during reload, inspired by PR 97756.
Normally, a double-word right-shift by a constant produces a double-word
result, the highpart of which is dead when followed by a truncation.
The dead code calculating the high part gets cleaned up post-reload, so
the issue isn't normally visible, except for the increased register
pressure during reload, sometimes leading to odd register assignments.
Providing a post-reload splitter, which clobbers a single wordmode
result register instead of a doubleword result register, helps (a bit).

An example demonstrating this effect is:

#define MASK60 ((1ul << 60) - 1)
unsigned long foo (__uint128_t n)
{
  unsigned long a = n & MASK60;
  unsigned long b = (n >> 60);
  b = b & MASK60;
  unsigned long c = (n >> 120);
  return a+b+c;
}

which currently with -O2 generates (13 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rsi, %rax
shrdq   $60, %rdi, %rax
movq%rax, %rdx
movq%rsi, %rax
movq%rdi, %rsi
andq%rcx, %rax
shrq$56, %rsi
andq%rcx, %rdx
addq%rsi, %rax
addq%rdx, %rax
ret

with this patch, we generate one less mov (12 instructions):
foo:movabsq $1152921504606846975, %rcx
xchgq   %rdi, %rsi
movq%rdi, %rdx
movq%rsi, %rax
movq%rdi, %rsi
shrdq   $60, %rdi, %rdx
andq%rcx, %rax
shrq$56, %rsi
addq%rsi, %rax
andq%rcx, %rdx
addq%rdx, %rax
ret

The significant difference is easier to see via diff:
<   shrdq   $60, %rdi, %rax
<   movq%rax, %rdx
---
>   shrdq   $60, %rdi, %rdx


Admittedly a single "mov" isn't much of a saving on modern architectures,
but as demonstrated by the PR, people still track the number of them.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-11-12  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (3_doubleword_lowpart): New
define_insn_and_split to optimize register usage of doubleword
right shifts followed by truncation.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 663db73..8a6928f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14833,6 +14833,31 @@
   [(const_int 0)]
   "ix86_split_ (operands, operands[3], mode); DONE;")
 
+;; Split truncations of TImode right shifts into x86_64_shrd_1.
+;; Split truncations of DImode right shifts into x86_shrd_1.
+(define_insn_and_split "3_doubleword_lowpart"
+  [(set (match_operand:DWIH 0 "register_operand" "=&r")
+   (subreg:DWIH
+ (any_shiftrt: (match_operand: 1 "register_operand" "r")
+(match_operand:QI 2 "const_int_operand")) 0))
+   (clobber (reg:CC FLAGS_REG))]
+  "UINTVAL (operands[2]) <  * BITS_PER_UNIT"
+  "#"
+  "&& reload_completed"
+  [(parallel
+  [(set (match_dup 0)
+   (ior:DWIH (lshiftrt:DWIH (match_dup 0) (match_dup 2))
+ (subreg:DWIH
+   (ashift: (zero_extend: (match_dup 3))
+ (match_dup 4)) 0)))
+   (clobber (reg:CC FLAGS_REG))])]
+{
+  split_double_mode (mode, &operands[1], 1, &operands[1], &operands[3]);
+  operands[4] = GEN_INT (( * BITS_PER_UNIT) - INTVAL (operands[2]));
+  if (!rtx_equal_p (operands[0], operands[3]))
+emit_move_insn (operands[0], operands[3]);
+})
+
 (define_insn "x86_64_shrd"
   [(set (match_operand:DI 0 "nonimmediate_operand" "+r*m")
 (ior:DI (lshiftrt:DI (match_dup 0)


[PATCH] Fix (fcopysign x, NEGATIVE_CONST) -> (fneg (fabs x)) simplification [PR112483]

2023-11-12 Thread Xi Ruoyao
(fcopysign x, NEGATIVE_CONST) can be simplified to (fneg (fabs x)), but
a logic error in the code caused it mistakenly simplified to (fneg x)
instead.

gcc/ChangeLog:

PR rtl-optimization/112483
* simplify-rtx.cc (simplify_binary_operation_1) :
Fix the simplification of (fcopysign x, NEGATIVE_CONST).
---

Bootstrapped and regtested on loongarch64-linux-gnu and
x86_64-linux-gnu.  Ok for trunk?

 gcc/simplify-rtx.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 69d87579d9c..2d2e5a3c1ca 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -4392,7 +4392,7 @@ simplify_ashift:
  real_convert (&f1, mode, CONST_DOUBLE_REAL_VALUE (trueop1));
  rtx tmp = simplify_gen_unary (ABS, mode, op0, mode);
  if (REAL_VALUE_NEGATIVE (f1))
-   tmp = simplify_gen_unary (NEG, mode, op0, mode);
+   tmp = simplify_gen_unary (NEG, mode, tmp, mode);
  return tmp;
}
   if (GET_CODE (op0) == NEG || GET_CODE (op0) == ABS)
-- 
2.42.1



Re: [PATCH v2] In the pipeline, USE or CLOBBER should delay execution if it starts a new live range.

2023-11-12 Thread Xi Ruoyao
On Sun, 2023-11-12 at 11:02 -0700, Jeff Law wrote:
> 
> 
> On 11/12/23 10:41, Xi Ruoyao wrote:
> > On Sat, 2023-11-11 at 13:12 -0700, Jeff Law wrote:
> > > 
> > > 
> > > On 8/14/23 05:22, Jin Ma wrote:
> > > > CLOBBER and USE does not represent real instructions, but in the
> > > > process of pipeline optimization, they will wait for
> > > > transmission
> > > > in ready list like other insns, without considering resource
> > > > conflicts and cycles. This results in a multi-issue CPU
> > > > architecture
> > > > that can be issued at any time if other regular insns have
> > > > resource
> > > > conflicts or cannot be launched for other reasons. As a result,
> > > > its position is advanced in the generated insns sequence, which
> > > > will affect register allocation and often lead to more redundant
> > > > mov instructions.
> > > > 
> > > > A simple example:
> > > > https://github.com/majin2020/gcc-test/blob/master/test.c
> > > > This is a function in the dhrystone benchmark.
> > > > 
> > > > https://github.com/majin2020/gcc-test/blob/0b08c1a13de9663d7d9aba7539b960ec0607ca24/test.c.299r.sched1
> > > > This is a log of the pass 'sched1' When -mtune=rocket but
> > > > issue_rate
> > > > == 2.
> > > > 
> > > > The pipeline is:
> > > > ;; | insn | prio |
> > > > ;; |  17  |  3   | r142=a0 alu
> > > > ;; |  14  |  0   | clobber r136 nothing
> > > > ;; |  13  |  0   | clobber a0 nothing
> > > > ;; |  18  |  2   | r143=a1 alu
> > > > ...
> > > > ;; |  12  |  0   | a0=r136 alu
> > > > ;; |  15  |  0   | use a0 nothing
> > > > 
> > > > In this log, insn 13 and 14 are much ahead of schedule, which
> > > > risks
> > > > generating
> > > > redundant mov instructions, which seems unreasonable.
> > > > 
> > > > Therefore, I submit patch again on the basis of the last review
> > > > opinions to try to solve this problem.
> > > > 
> > > > https://github.com/majin2020/gcc-test/commit/efcb43e3369e771bde702955048bfe3f501263dd#diff-805031b1be5092a2322852a248d0b0f92eef7cad5784a8209f4dfc6221407457L189
> > > > This is the diff log of shed1 after patch is added.
> > > > 
> > > > The new pipeline is:
> > > > ;; | insn | prio |
> > > > ;; |  17  |  3   | r142=a0 alu
> > > > ...
> > > > ;; |  10  |  0   | [r144]=r141 alu
> > > > ;; |  13  |  0   | clobber a0 nothing
> > > > ;; |  14  |  0   | clobber r136 nothing
> > > > ;; |  12  |  0   | a0=r136 alu
> > > > ;; |  15  |  0   | use a0 nothing
> > > > 
> > > > gcc/ChangeLog:
> > > > * haifa-sched.cc (use_or_clobber_starts_range_p): New.
> > > > (prune_ready_list): USE or CLOBBER should delay
> > > > execution
> > > > if it starts a new live range.
> > > OK for the trunk.  It doesn't look like you have write access and
> > > I
> > > don't see anything about what testing was done.  Standard practice
> > > is
> > > to
> > > do a bootstrap and regression test on a primary platform such as
> > > x86,
> > > aarch64, ppc64.
> > > 
> > > I went ahead and did a bootstrap and regression test on x86_64,
> > > then
> > > pushed this to the trunk.
> > 
> > Unfortunately this patch has triggered a bootstrap comparison
> > failure on
> > loongarch64-linux-gnu: https://gcc.gnu.org/PR112497.
> It's also causing simple build failures on other targets.  For example
> c6x-elf aborts when compiling gcc.c-torture/execute/pr82210 (and
> others) 
> with -O2 with that patch applied.
> 
> I've reverted it for now.  I'm not going to have time to investigate 
> this week.

So I'm marking the PR fixed.  Please CC me when iterating this patch for
another round so I can test it on loongarch64-linux-gnu.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCH v2] In the pipeline, USE or CLOBBER should delay execution if it starts a new live range.

2023-11-12 Thread Jeff Law




On 11/12/23 10:41, Xi Ruoyao wrote:

On Sat, 2023-11-11 at 13:12 -0700, Jeff Law wrote:



On 8/14/23 05:22, Jin Ma wrote:

CLOBBER and USE does not represent real instructions, but in the
process of pipeline optimization, they will wait for transmission
in ready list like other insns, without considering resource
conflicts and cycles. This results in a multi-issue CPU architecture
that can be issued at any time if other regular insns have resource
conflicts or cannot be launched for other reasons. As a result,
its position is advanced in the generated insns sequence, which
will affect register allocation and often lead to more redundant
mov instructions.

A simple example:
https://github.com/majin2020/gcc-test/blob/master/test.c
This is a function in the dhrystone benchmark.

https://github.com/majin2020/gcc-test/blob/0b08c1a13de9663d7d9aba7539b960ec0607ca24/test.c.299r.sched1
This is a log of the pass 'sched1' When -mtune=rocket but issue_rate
== 2.

The pipeline is:
;; | insn | prio |
;; |  17  |  3   | r142=a0 alu
;; |  14  |  0   | clobber r136 nothing
;; |  13  |  0   | clobber a0 nothing
;; |  18  |  2   | r143=a1 alu
...
;; |  12  |  0   | a0=r136 alu
;; |  15  |  0   | use a0 nothing

In this log, insn 13 and 14 are much ahead of schedule, which risks
generating
redundant mov instructions, which seems unreasonable.

Therefore, I submit patch again on the basis of the last review
opinions to try to solve this problem.

https://github.com/majin2020/gcc-test/commit/efcb43e3369e771bde702955048bfe3f501263dd#diff-805031b1be5092a2322852a248d0b0f92eef7cad5784a8209f4dfc6221407457L189
This is the diff log of shed1 after patch is added.

The new pipeline is:
;; | insn | prio |
;; |  17  |  3   | r142=a0 alu
...
;; |  10  |  0   | [r144]=r141 alu
;; |  13  |  0   | clobber a0 nothing
;; |  14  |  0   | clobber r136 nothing
;; |  12  |  0   | a0=r136 alu
;; |  15  |  0   | use a0 nothing

gcc/ChangeLog:
* haifa-sched.cc (use_or_clobber_starts_range_p): New.
(prune_ready_list): USE or CLOBBER should delay execution
if it starts a new live range.

OK for the trunk.  It doesn't look like you have write access and I
don't see anything about what testing was done.  Standard practice is
to
do a bootstrap and regression test on a primary platform such as x86,
aarch64, ppc64.

I went ahead and did a bootstrap and regression test on x86_64, then
pushed this to the trunk.


Unfortunately this patch has triggered a bootstrap comparison failure on
loongarch64-linux-gnu: https://gcc.gnu.org/PR112497.
It's also causing simple build failures on other targets.  For example 
c6x-elf aborts when compiling gcc.c-torture/execute/pr82210 (and others) 
with -O2 with that patch applied.


I've reverted it for now.  I'm not going to have time to investigate 
this week.


Jeff




Re: [PATCH v2] In the pipeline, USE or CLOBBER should delay execution if it starts a new live range.

2023-11-12 Thread Xi Ruoyao
On Sat, 2023-11-11 at 13:12 -0700, Jeff Law wrote:
> 
> 
> On 8/14/23 05:22, Jin Ma wrote:
> > CLOBBER and USE does not represent real instructions, but in the
> > process of pipeline optimization, they will wait for transmission
> > in ready list like other insns, without considering resource
> > conflicts and cycles. This results in a multi-issue CPU architecture
> > that can be issued at any time if other regular insns have resource
> > conflicts or cannot be launched for other reasons. As a result,
> > its position is advanced in the generated insns sequence, which
> > will affect register allocation and often lead to more redundant
> > mov instructions.
> > 
> > A simple example:
> > https://github.com/majin2020/gcc-test/blob/master/test.c
> > This is a function in the dhrystone benchmark.
> > 
> > https://github.com/majin2020/gcc-test/blob/0b08c1a13de9663d7d9aba7539b960ec0607ca24/test.c.299r.sched1
> > This is a log of the pass 'sched1' When -mtune=rocket but issue_rate
> > == 2.
> > 
> > The pipeline is:
> > ;; | insn | prio |
> > ;; |  17  |  3   | r142=a0 alu
> > ;; |  14  |  0   | clobber r136 nothing
> > ;; |  13  |  0   | clobber a0 nothing
> > ;; |  18  |  2   | r143=a1 alu
> > ...
> > ;; |  12  |  0   | a0=r136 alu
> > ;; |  15  |  0   | use a0 nothing
> > 
> > In this log, insn 13 and 14 are much ahead of schedule, which risks
> > generating
> > redundant mov instructions, which seems unreasonable.
> > 
> > Therefore, I submit patch again on the basis of the last review
> > opinions to try to solve this problem.
> > 
> > https://github.com/majin2020/gcc-test/commit/efcb43e3369e771bde702955048bfe3f501263dd#diff-805031b1be5092a2322852a248d0b0f92eef7cad5784a8209f4dfc6221407457L189
> > This is the diff log of shed1 after patch is added.
> > 
> > The new pipeline is:
> > ;; | insn | prio |
> > ;; |  17  |  3   | r142=a0 alu
> > ...
> > ;; |  10  |  0   | [r144]=r141 alu
> > ;; |  13  |  0   | clobber a0 nothing
> > ;; |  14  |  0   | clobber r136 nothing
> > ;; |  12  |  0   | a0=r136 alu
> > ;; |  15  |  0   | use a0 nothing
> > 
> > gcc/ChangeLog:
> > * haifa-sched.cc (use_or_clobber_starts_range_p): New.
> > (prune_ready_list): USE or CLOBBER should delay execution
> > if it starts a new live range.
> OK for the trunk.  It doesn't look like you have write access and I 
> don't see anything about what testing was done.  Standard practice is
> to 
> do a bootstrap and regression test on a primary platform such as x86, 
> aarch64, ppc64.
> 
> I went ahead and did a bootstrap and regression test on x86_64, then 
> pushed this to the trunk.

Unfortunately this patch has triggered a bootstrap comparison failure on
loongarch64-linux-gnu: https://gcc.gnu.org/PR112497.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


[committed] i386: Remove *stack_protect_set_4s__di alternative that will never match

2023-11-12 Thread Uros Bizjak
The relevant peephole2 will never generate alternative (=m,=&a,0,m) because
operand 1 is not dead before the peephole2 pattern.

gcc/ChangeLog:

* config/i386/i386.md (*stack_protect_set_4s__di):
Remove alternative 0.

Bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 01fc6ecc351..ffd9f2d0381 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -24481,19 +24481,16 @@ (define_insn "*stack_protect_set_4z__di"
(set_attr "length" "24")])
 
 (define_insn "*stack_protect_set_4s__di"
-  [(set (match_operand:PTR 0 "memory_operand" "=m,m")
-   (unspec:PTR [(match_operand:PTR 3 "memory_operand" "m,m")]
+  [(set (match_operand:PTR 0 "memory_operand" "=m")
+   (unspec:PTR [(match_operand:PTR 3 "memory_operand" "m")]
UNSPEC_SP_SET))
-   (set (match_operand:DI 1 "register_operand" "=&a,&r")
-   (sign_extend:DI (match_operand:SI 2 "nonimmediate_operand" "0,rm")))]
+   (set (match_operand:DI 1 "register_operand" "=&r")
+   (sign_extend:DI (match_operand:SI 2 "nonimmediate_operand" "rm")))]
   "TARGET_64BIT && reload_completed"
 {
   output_asm_insn ("mov{}\t{%3, %1|%1, %3}", operands);
   output_asm_insn ("mov{}\t{%1, %0|%0, %1}", operands);
-  if (which_alternative)
-return "movs{lq|x}\t{%2, %1|%1, %2}";
-  else
-return "{cltq|cdqe}";
+  return "movs{lq|x}\t{%2, %1|%1, %2}";
 }
   [(set_attr "type" "multi")
(set_attr "length" "24")])


[PATCH 5/5] Add an aligned_register_operand predicate

2023-11-12 Thread Richard Sandiford
This patch adds a target-independent aligned_register_operand
predicate, for use with register constraints that use filters
to impose an alignment.  The definition deliberately jetisons
some of the historical baggage in general_operand.

gcc/
* common.md (aligned_register_operand): New predicate.
---
 gcc/common.md | 28 
 1 file changed, 28 insertions(+)

diff --git a/gcc/common.md b/gcc/common.md
index 51ecd79786b..91a72bd7731 100644
--- a/gcc/common.md
+++ b/gcc/common.md
@@ -17,6 +17,34 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; .  */
 
+;; This predicate is intended to be paired with register constraints that use
+;; register filters to impose an alignment.  Operands that are aligned via
+;; TARGET_HARD_REGNO_MODE_OK should use normal register_operands instead.
+(define_predicate "aligned_register_operand"
+  (match_code "reg,subreg")
+{
+  /* Require the offset in a non-paradoxical subreg to be naturally aligned.
+ For example, if we have a subreg of something that is double the size of
+ this operand, the offset must select the first or second half of it.  */
+  if (SUBREG_P (op)
+  && multiple_p (SUBREG_BYTE (op), GET_MODE_SIZE (GET_MODE (op
+op = SUBREG_REG (op);
+  if (!REG_P (op))
+return false;
+
+  if (HARD_REGISTER_P (op))
+{
+  if (!in_hard_reg_set_p (operand_reg_set, GET_MODE (op), REGNO (op)))
+   return false;
+
+  /* Reject hard registers that would need reloading, so that the reload
+is visible to IRA and to pre-RA optimizers.  */
+  if (REGNO (op) % REG_NREGS (op) != 0)
+   return false;
+}
+  return true;
+})
+
 (define_register_constraint "r" "GENERAL_REGS"
   "Matches any general register.")
 
-- 
2.25.1



[PATCH 4/5] ira: Handle register filters

2023-11-12 Thread Richard Sandiford
This patch makes IRA apply register filters when picking hard registers.
All the new code should be optimised away on targets that don't use
register filters.  On targets that do use them, the new register_filters
bitfield is expected to be only a handful of bits.

Information about register filters is recorded in process_bb_node_lives.
The information isn't really related to liveness, but it's a convenient
point because (a) we've already built the allocno structures and
(b) we've already extracted the insn and preprocessed the constraints.

gcc/
* ira-int.h (ira_allocno): Add a register_filters field.
(ALLOCNO_REGISTER_FILTERS): New macro.
(ALLOCNO_SET_REGISTER_FILTERS): Likewise.
* ira-build.cc (ira_create_allocno): Initialize register_filters.
(create_cap_allocno): Propagate register_filters.
(propagate_allocno_info): Likewise.
(propagate_some_info_from_allocno): Likewise.
* ira-lives.cc (process_register_constraint_filters): New function.
(process_bb_node_lives): Use it to record register filter
information.
* ira-color.cc (assign_hard_reg): Check register filters.
(improve_allocation, fast_allocation): Likewise.
---
 gcc/ira-build.cc |  8 +++
 gcc/ira-color.cc | 10 
 gcc/ira-int.h| 14 +++
 gcc/ira-lives.cc | 61 
 4 files changed, 93 insertions(+)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index 93e46033170..c715a834f12 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -498,6 +498,7 @@ ira_create_allocno (int regno, bool cap_p,
   ALLOCNO_NREFS (a) = 0;
   ALLOCNO_FREQ (a) = 0;
   ALLOCNO_MIGHT_CONFLICT_WITH_PARENT_P (a) = false;
+  ALLOCNO_SET_REGISTER_FILTERS (a, 0);
   ALLOCNO_HARD_REGNO (a) = -1;
   ALLOCNO_CALL_FREQ (a) = 0;
   ALLOCNO_CALLS_CROSSED_NUM (a) = 0;
@@ -902,6 +903,7 @@ create_cap_allocno (ira_allocno_t a)
   ALLOCNO_NREFS (cap) = ALLOCNO_NREFS (a);
   ALLOCNO_FREQ (cap) = ALLOCNO_FREQ (a);
   ALLOCNO_CALL_FREQ (cap) = ALLOCNO_CALL_FREQ (a);
+  ALLOCNO_SET_REGISTER_FILTERS (cap, ALLOCNO_REGISTER_FILTERS (a));
 
   merge_hard_reg_conflicts (a, cap, false);
 
@@ -2064,6 +2066,9 @@ propagate_allocno_info (void)
ALLOCNO_BAD_SPILL_P (parent_a) = false;
  ALLOCNO_NREFS (parent_a) += ALLOCNO_NREFS (a);
  ALLOCNO_FREQ (parent_a) += ALLOCNO_FREQ (a);
+ ALLOCNO_SET_REGISTER_FILTERS (parent_a,
+   ALLOCNO_REGISTER_FILTERS (parent_a)
+   | ALLOCNO_REGISTER_FILTERS (a));
 
  /* If A's allocation can differ from PARENT_A's, we can if necessary
 spill PARENT_A on entry to A's loop and restore it afterwards.
@@ -2465,6 +2470,9 @@ propagate_some_info_from_allocno (ira_allocno_t a, 
ira_allocno_t from_a)
   ALLOCNO_CROSSED_CALLS_ABIS (a) |= ALLOCNO_CROSSED_CALLS_ABIS (from_a);
   ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (a)
 |= ALLOCNO_CROSSED_CALLS_CLOBBERED_REGS (from_a);
+  ALLOCNO_SET_REGISTER_FILTERS (a,
+   ALLOCNO_REGISTER_FILTERS (from_a)
+   | ALLOCNO_REGISTER_FILTERS (a));
 
   ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (a)
 += ALLOCNO_EXCESS_PRESSURE_POINTS_NUM (from_a);
diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index f2e8ea34152..214a4f16d3c 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -2163,6 +2163,9 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
   if (! check_hard_reg_p (a, hard_regno,
  conflicting_regs, profitable_hard_regs))
continue;
+  if (NUM_REGISTER_FILTERS
+ && !test_register_filters (ALLOCNO_REGISTER_FILTERS (a), hard_regno))
+   continue;
   cost = costs[i];
   full_cost = full_costs[i];
   if (!HONOR_REG_ALLOC_ORDER)
@@ -3205,6 +3208,9 @@ improve_allocation (void)
  if (! check_hard_reg_p (a, hregno,
  conflicting_regs, profitable_hard_regs))
continue;
+ if (NUM_REGISTER_FILTERS
+ && !test_register_filters (ALLOCNO_REGISTER_FILTERS (a), hregno))
+   continue;
  ira_assert (ira_class_hard_reg_index[aclass][hregno] == j);
  k = allocno_costs == NULL ? 0 : j;
  costs[hregno] = (allocno_costs == NULL
@@ -5275,6 +5281,10 @@ fast_allocation (void)
  || (TEST_HARD_REG_BIT
  (ira_prohibited_class_mode_regs[aclass][mode], hard_regno)))
continue;
+ if (NUM_REGISTER_FILTERS
+ && !test_register_filters (ALLOCNO_REGISTER_FILTERS (a),
+hard_regno))
+   continue;
  if (costs == NULL)
{
  best_hard_regno = hard_regno;
diff --git a/gcc/ira-int.h b/gcc/ira-int.h
index 0685e1f4e8d..1c3548df4ea 100644
--- a/gcc/ira-int.h
+++ b/gcc/ira-int.h
@@ -328,6 +328,13 @@ struct ira_allocno
 
  This is only ever true for 

[PATCH 3/5] lra: Handle register filters

2023-11-12 Thread Richard Sandiford
This patch makes LRA apply register filters.  This plus the recog
change is enough for correct code generation, but a follow-on IRA
patch improves the allocation.

All the new code should be optimised away on targets that don't
use register filters.  That's because get_register_filter just
wraps "return nullptr" on those targets.

gcc/
* lra-constraints.cc (process_alt_operands): Check register filters.
---
 gcc/lra-constraints.cc | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 0607c8be7cb..9b6a2af5b75 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -2149,6 +2149,7 @@ process_alt_operands (int only_alternative)
   int reload_nregs, reload_sum;
   bool costly_p;
   enum reg_class cl;
+  const HARD_REG_SET *cl_filter;
 
   /* Calculate some data common for all alternatives to speed up the
  function. */
@@ -2514,6 +2515,7 @@ process_alt_operands (int only_alternative)
  || spilled_pseudo_p (op))
win = true;
  cl = GENERAL_REGS;
+ cl_filter = nullptr;
  goto reg;
 
default:
@@ -2523,7 +2525,10 @@ process_alt_operands (int only_alternative)
case CT_REGISTER:
  cl = reg_class_for_constraint (cn);
  if (cl != NO_REGS)
-   goto reg;
+   {
+ cl_filter = get_register_filter (cn);
+ goto reg;
+   }
  break;
 
case CT_CONST_INT:
@@ -2567,6 +2572,7 @@ process_alt_operands (int only_alternative)
win = true;
  cl = base_reg_class (VOIDmode, ADDR_SPACE_GENERIC,
   ADDRESS, SCRATCH);
+ cl_filter = nullptr;
  badop = false;
  goto reg;
 
@@ -2600,6 +2606,8 @@ process_alt_operands (int only_alternative)
this_alternative_exclude_start_hard_regs
  |= ira_exclude_class_mode_regs[cl][mode];
  this_alternative_set |= reg_class_contents[cl];
+ if (cl_filter)
+   this_alternative_exclude_start_hard_regs |= ~*cl_filter;
  if (costly_p)
{
  this_costly_alternative
@@ -2613,6 +2621,9 @@ process_alt_operands (int only_alternative)
  if (hard_regno[nop] >= 0
  && in_hard_reg_set_p (this_alternative_set,
mode, hard_regno[nop])
+ && (!cl_filter
+ || TEST_HARD_REG_BIT (*cl_filter,
+   hard_regno[nop]))
  && ((REG_ATTRS (op) && (decl = REG_EXPR (op)) != NULL
   && VAR_P (decl) && DECL_HARD_REGISTER (decl))
  || !(TEST_HARD_REG_BIT
-- 
2.25.1



[PATCH 2/5] recog: Handle register filters

2023-11-12 Thread Richard Sandiford
The main (but simplest) part of this patch makes constrain_operands
take register filters into account.

The rest of the patch adds register filter information to
operand_alternative.  Generally, if two register constraints
have different register filters, it's better if they're in separate
alternatives.  However, the syntax doesn't enforce that, and we can't
assert it due to inline asms.  So it's a choice between (a) adding
code to enforce consistent filters or (b) dealing with mixes of filters
in a conservatively correct way (in the sense of not allowing invalid
operands).  The latter seems much easier.

The patch therefore adds a mask of the filters that apply
to at least one constraint in a given operand alternative.
A register is OK if it passes all of the filters in the mask.

gcc/
* recog.h (operand_alternative): Add a register_filters field.
(alternative_register_filters): New function.
* recog.cc (preprocess_constraints): Calculate the filters field.
(constrain_operands): Check register filters.
---
 gcc/recog.cc | 14 --
 gcc/recog.h  | 24 ++--
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index 3bd2d73c259..eaab79c25d7 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -2857,6 +2857,7 @@ preprocess_constraints (int n_operands, int 
n_alternatives,
   for (j = 0; j < n_alternatives; j++, op_alt += n_operands)
{
  op_alt[i].cl = NO_REGS;
+ op_alt[i].register_filters = 0;
  op_alt[i].constraint = p;
  op_alt[i].matches = -1;
  op_alt[i].matched = -1;
@@ -2919,7 +2920,12 @@ preprocess_constraints (int n_operands, int 
n_alternatives,
case CT_REGISTER:
  cl = reg_class_for_constraint (cn);
  if (cl != NO_REGS)
-   op_alt[i].cl = reg_class_subunion[op_alt[i].cl][cl];
+   {
+ op_alt[i].cl = reg_class_subunion[op_alt[i].cl][cl];
+ auto filter_id = get_register_filter_id (cn);
+ if (filter_id >= 0)
+   op_alt[i].register_filters |= 1U << filter_id;
+   }
  break;
 
case CT_CONST_INT:
@@ -3219,13 +3225,17 @@ constrain_operands (int strict, alternative_mask 
alternatives)
  enum reg_class cl = reg_class_for_constraint (cn);
  if (cl != NO_REGS)
{
+ auto *filter = get_register_filter (cn);
  if (strict < 0
  || (strict == 0
  && REG_P (op)
  && REGNO (op) >= FIRST_PSEUDO_REGISTER)
  || (strict == 0 && GET_CODE (op) == SCRATCH)
  || (REG_P (op)
- && reg_fits_class_p (op, cl, offset, mode)))
+ && reg_fits_class_p (op, cl, offset, mode)
+ && (!filter
+ || TEST_HARD_REG_BIT (*filter,
+   REGNO (op) + offset
win = true;
}
 
diff --git a/gcc/recog.h b/gcc/recog.h
index c6ef619c5dd..5c801e7bb81 100644
--- a/gcc/recog.h
+++ b/gcc/recog.h
@@ -42,6 +42,7 @@ enum op_type {
   OP_INOUT
 };
 
+#ifndef GENERATOR_FILE
 struct operand_alternative
 {
   /* Pointer to the beginning of the constraint string for this alternative,
@@ -62,6 +63,11 @@ struct operand_alternative
  matches this one.  */
   int matched : 8;
 
+  /* Bit ID is set if the constraint string includes a register constraint with
+ register filter ID.  Use test_register_filters (REGISTER_FILTERS, REGNO)
+ to test whether REGNO is a valid start register for the operand.  */
+  unsigned int register_filters : MAX (NUM_REGISTER_FILTERS, 1);
+
   /* Nonzero if '&' was found in the constraint string.  */
   unsigned int earlyclobber : 1;
   /* Nonzero if TARGET_MEM_CONSTRAINT was found in the constraint
@@ -72,8 +78,6 @@ struct operand_alternative
   /* Nonzero if 'X' was found in the constraint string, or if the constraint
  string for this alternative was empty.  */
   unsigned int anything_ok : 1;
-
-  unsigned int unused : 12;
 };
 
 /* Return the class for operand I of alternative ALT, taking matching
@@ -85,6 +89,18 @@ alternative_class (const operand_alternative *alt, int i)
   return alt[i].matches >= 0 ? alt[alt[i].matches].cl : alt[i].cl;
 }
 
+/* Return the mask of register filters that should be applied to operand I
+   of alternative ALT, taking matching constraints into account.  */
+
+inline unsigned int
+alternative_register_filters (const operand_alternative *alt, int i)
+{
+  return (alt[i].matches >= 0
+ ? alt[alt[i].matches].register_filters
+ : alt[i].register_filters);
+}
+#

[PATCH 1/5] Add register filter operand to define_register_constraint

2023-11-12 Thread Richard Sandiford
The main way of enforcing registers to be aligned is through
HARD_REGNO_MODE_OK.  But this is a global property that applies
to all operands.  A given (regno, mode) pair is either globally
valid or globally invalid.

This patch instead adds a way of specifying that individual operands
must be aligned.  More generally, it allows constraints to specify
a C++ condition that the operand's REGNO must satisfy.  The condition
must be invariant for a given set of target options, so that it can
be precomputed and cached as a HARD_REG_SET.

This information will be used in very compile-time-sensitive
parts of the compiler.  A lot of the complication is in allowing
the information to be stored and tested without much memory cost,
and without impacting targets that don't use the feature.

Specifically:

- Constraints are encouraged to test the absolute REGNO rather than
  an offset from the start of the containing class.  For example,
  all constraints for even registers should use the same condition,
  such as "regno % 2 == 0".  This requires the classes to start at
  even register boundaries, but that's already an implicit
  requirement due to things like the ira-costs.cc code that begins:

  /* Some targets allow pseudos to be allocated to unaligned sequences
 of hard registers.  However, selecting an unaligned sequence can
 unnecessarily restrict later allocations.  So increase the cost of
 unaligned hard regs to encourage the use of aligned hard regs.  */

- Each unique condition is given a "filter identifier".

- The total number of filters is given by NUM_REGISTER_FILTERS,
  defined automatically in insn-config.h.  Structures can therefore use
  a bitfield of NUM_REGISTER_FILTERS to represent a mask of filters.

- There is a new target global, target_constraints, that caches the
  HARD_REG_SET for each filter.

- There is a function for looking up the HARD_REG_SET filter for a given
  constraint and one for looking up the filter id.  Both simply return
  a constant on targets that don't use the feature.

- There are functions for testing a register against a specific filter,
  or against a mask of filters.

This patch just adds the information.  Later ones make use of it.

gcc/
* rtl.def (DEFINE_REGISTER_CONSTRAINT): Add an optional filter
operand.
* doc/md.texi (define_register_constraint): Document it.
* doc/tm.texi.in: Reference it in discussion about aligned registers.
* doc/tm.texi: Regenerate.
* gensupport.h (register_filters, get_register_filter_id): Declare.
* gensupport.cc (register_filter_map, register_filters): New variables.
(get_register_filter_id): New function.
(process_define_register_constraint): Likewise.
(process_rtx): Pass define_register_constraints to
process_define_register_constraint.
* genconfig.cc (main): Emit a definition of NUM_REGISTER_FILTERS.
* genpreds.cc (constraint_data): Add a filter field.
(add_constraint): Update accordingly.
(process_define_register_constraint): Pass the filter operand.
(write_init_reg_class_start_regs): New function.
(write_get_register_filter): Likewise.
(write_get_register_filter_id): Likewise.
(write_tm_preds_h): Write a definition of target_constraints,
plus helpers to test its contents.  Write the get_register_filter*
functions.
(write_insn_preds_c): Write init_reg_class_start_regs.
* reginfo.cc (init_reg_class_start_regs): Declare.
(init_reg_sets): Call it.
* target-globals.h (this_target_constraints): Declare.
(target_globals): Add a constraints field.
(restore_target_globals): Update accordingly.
* target-globals.cc: Include tm_p.h.
(default_target_globals): Initialize the constraints field.
(save_target_globals): Handle the constraints field.
(target_globals::~target_globals): Likewise.
---
 gcc/doc/md.texi   |  41 +++-
 gcc/doc/tm.texi   |   3 +-
 gcc/doc/tm.texi.in|   3 +-
 gcc/genconfig.cc  |   2 +
 gcc/genpreds.cc   | 146 +-
 gcc/gensupport.cc |  48 +-
 gcc/gensupport.h  |   3 +
 gcc/reginfo.cc|   5 ++
 gcc/rtl.def   |   6 +-
 gcc/target-globals.cc |   6 +-
 gcc/target-globals.h  |   3 +
 11 files changed, 254 insertions(+), 12 deletions(-)

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 5d86152e5dd..dc8c068646d 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -4513,8 +4513,8 @@ Register constraints correspond directly to register 
classes.
 @xref{Register Classes}.  There is thus not much flexibility in their
 definitions.
 
-@deffn {MD Expression} define_register_constraint name regclass docstring
-All three arguments are string constants.
+@deffn {MD Expression} define_register_constraint name regclass docstring 
[filter]
+All arguments are string constants

[PATCH 0/5] Add support for operand-specific alignment requirements

2023-11-12 Thread Richard Sandiford
SME has various instructions that require aligned register tuples.
However, the associated tuple modes are already widely used and do
not need to be aligned in other contexts.  It therefore isn't
appropriate to force alignment in TARGET_HARD_REGNO_MODE_OK.

There are also strided loads and stores that require:

- (regno & 0x8) == 0 for 2-register tuples
- (regno & 0xc) == 0 for 4-register tuples

Although the requirements for strided loads and stores could be
enforced by C++ conditions on the insn, it's convenient to handle
them in the same way as alignment.

This series of patches therefore adds a way for register constraints
to specify which start registers are valid and which aren't.  Most of
the details are in the covering note to the first patch.

This is clearly changing a performance-sensitive part of the compiler.
I've tried to ensure that the overhead is only small for targets that
use the new feature.  Almost all of the new code gets optimised away
on targets that don't use the feature.

Richard Sandiford (5):
  Add register filter operand to define_register_constraint
  recog: Handle register filters
  lra: Handle register filters
  ira: Handle register filters
  Add an aligned_register_operand predicate

 gcc/common.md  |  28 
 gcc/doc/md.texi|  41 +++-
 gcc/doc/tm.texi|   3 +-
 gcc/doc/tm.texi.in |   3 +-
 gcc/genconfig.cc   |   2 +
 gcc/genpreds.cc| 146 -
 gcc/gensupport.cc  |  48 +-
 gcc/gensupport.h   |   3 +
 gcc/ira-build.cc   |   8 +++
 gcc/ira-color.cc   |  10 +++
 gcc/ira-int.h  |  14 
 gcc/ira-lives.cc   |  61 +
 gcc/lra-constraints.cc |  13 +++-
 gcc/recog.cc   |  14 +++-
 gcc/recog.h|  24 ++-
 gcc/reginfo.cc |   5 ++
 gcc/rtl.def|   6 +-
 gcc/target-globals.cc  |   6 +-
 gcc/target-globals.h   |   3 +
 19 files changed, 421 insertions(+), 17 deletions(-)

-- 
2.25.1



[PATCH v1] RISC-V: Support FP l/ll round and rint HF mode autovec

2023-11-12 Thread pan2 . li
From: Pan Li 

This patch would like to support the FP below API auto vectorization
with different type size

++---+--+
| API| RV64  | RV32 |
++---+--+
| lrintf16   | HF => DI  | HF => SI |
| llrintf16  | HF => DI  | HF => DI |
| lroundf16  | HF => DI  | HF => SI |
| llroundf16 | HF => DI  | HF => DI |
++---+--+

Given below code:
void
test_lrintf16 (long *out, _Float16 *in, int count)
{
  for (unsigned i = 0; i < count; i++)
out[i] = __builtin_lrintf16 (in[i]);
}

Before this patch:
.L3:
  lhu a5,0(s0)
  addis0,s0,2
  addis1,s1,8
  fmv.s.x fa0,a5
  calllrintf16
  sd  a0,-8(s1)
  bne s0,s2,.L3

After this patch:
.L3:
  vsetvli a5,a2,e16,mf4,ta,ma
  vle16.v v1,0(a1)
  vfwcvt.f.f.vv2,v1
  vsetvli zero,zero,e32,mf2,ta,ma
  vfwcvt.x.f.vv1,v2
  vse64.v v1,0(a0)
  sllia4,a5,1
  add a1,a1,a4
  sllia4,a5,3
  add a0,a0,a4
  sub a2,a2,a5
  bne a2,zero,.L3

gcc/ChangeLog:

* config/riscv/autovec.md: Add bridge mode to lrint and lround
pattern.
* config/riscv/riscv-protos.h (expand_vec_lrint): Add new arg
bridge machine mode.
(expand_vec_lround): Ditto.
* config/riscv/riscv-v.cc (emit_vec_widden_cvt_f_f): New helper
func impl to emit vfwcvt.f.f.
(emit_vec_rounding_to_integer): Handle the HF to DI rounding
with the bridge mode.
(expand_vec_lrint): Reorder the args.
(expand_vec_lround): Ditto.
(expand_vec_lceil): Ditto.
(expand_vec_lfloor): Ditto.
* config/riscv/vector-iterators.md: Add vector HFmode and bridge
mode for converting to DI.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/unop/math-llrintf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-llroundf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-llrintf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-llroundf16-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv64-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv32-0.c: New test.
* gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv64-0.c: New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/autovec.md   | 17 ++--
 gcc/config/riscv/riscv-protos.h   |  4 +-
 gcc/config/riscv/riscv-v.cc   | 51 
 gcc/config/riscv/vector-iterators.md  | 82 ++-
 .../riscv/rvv/autovec/unop/math-llrintf16-0.c | 14 
 .../rvv/autovec/unop/math-llroundf16-0.c  | 21 +
 .../rvv/autovec/unop/math-lrintf16-rv32-0.c   | 13 +++
 .../rvv/autovec/unop/math-lrintf16-rv64-0.c   | 15 
 .../rvv/autovec/unop/math-lroundf16-rv32-0.c  | 18 
 .../rvv/autovec/unop/math-lroundf16-rv64-0.c  | 20 +
 .../riscv/rvv/autovec/vls/math-llrintf16-0.c  | 28 +++
 .../riscv/rvv/autovec/vls/math-llroundf16-0.c | 28 +++
 .../rvv/autovec/vls/math-lrintf16-rv32-0.c| 27 ++
 .../rvv/autovec/vls/math-lrintf16-rv64-0.c| 28 +++
 .../rvv/autovec/vls/math-lroundf16-rv32-0.c   | 27 ++
 .../rvv/autovec/vls/math-lroundf16-rv64-0.c   | 28 +++
 16 files changed, 397 insertions(+), 24 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-llrintf16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-llroundf16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv32-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lrintf16-rv64-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv32-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv64-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-llrintf16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-llroundf16-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv32-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lrintf16-rv64-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv32-0.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/math-lroundf16-rv64-0.c

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index 868b47c8af7..80e41af6334 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -2455,14 +2455,13 @@ (d

[PATCH v3] DSE: Allow vector type for get_stored_val when read < store

2023-11-12 Thread pan2 . li
From: Pan Li 

Update in v3:
* Take known_le instead of known_lt for vector size.
* Return NULL_RTX when gap is not equal 0 and not constant.

Update in v2:
* Move vector type support to get_stored_val.

Original log:

This patch would like to allow the vector mode in the
get_stored_val in the DSE. It is valid for the read
rtx if and only if the read bitsize is less than the
stored bitsize.

Given below example code with
--param=riscv-autovec-preference=fixed-vlmax.

vuint8m1_t test () {
  uint8_t arr[32] = {
1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
  };

  return __riscv_vle8_v_u8m1(arr, 32);
}

Before this patch:
test:
  lui a5,%hi(.LANCHOR0)
  addisp,sp,-32
  addia5,a5,%lo(.LANCHOR0)
  li  a3,32
  vl2re64.v   v2,0(a5)
  vsetvli zero,a3,e8,m1,ta,ma
  vs2r.v  v2,0(sp) <== Unnecessary store to stack
  vle8.v  v1,0(sp) <== Ditto
  vs1r.v  v1,0(a0)
  addisp,sp,32
  jr  ra

After this patch:
test:
  lui a5,%hi(.LANCHOR0)
  addia5,a5,%lo(.LANCHOR0)
  li  a4,32
  addisp,sp,-32
  vsetvli zero,a4,e8,m1,ta,ma
  vle8.v  v1,0(a5)
  vs1r.v  v1,0(a0)
  addisp,sp,32
  jr  ra

Below tests are passed within this patch:
* The risc-v regression test.

Below tests are ongoing within this patch:
* The x86 bootstrap and regression test.
* The aarch64 regression test.

PR target/111720

gcc/ChangeLog:

* dse.cc (get_stored_val): Allow vector mode if read size is
less than or equal to stored size.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/float-point-dynamic-frm-54.c: Adjust
the asm checker.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-57.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-58.c: Ditto.
* gcc.target/riscv/rvv/base/pr111720-0.c: New test.
* gcc.target/riscv/rvv/base/pr111720-1.c: New test.
* gcc.target/riscv/rvv/base/pr111720-10.c: New test.
* gcc.target/riscv/rvv/base/pr111720-2.c: New test.
* gcc.target/riscv/rvv/base/pr111720-3.c: New test.
* gcc.target/riscv/rvv/base/pr111720-4.c: New test.
* gcc.target/riscv/rvv/base/pr111720-5.c: New test.
* gcc.target/riscv/rvv/base/pr111720-6.c: New test.
* gcc.target/riscv/rvv/base/pr111720-7.c: New test.
* gcc.target/riscv/rvv/base/pr111720-8.c: New test.
* gcc.target/riscv/rvv/base/pr111720-9.c: New test.

Signed-off-by: Pan Li 
---
 gcc/dse.cc|  9 +++-
 .../rvv/base/float-point-dynamic-frm-54.c |  2 +-
 .../rvv/base/float-point-dynamic-frm-57.c |  2 +-
 .../rvv/base/float-point-dynamic-frm-58.c |  2 +-
 .../gcc.target/riscv/rvv/base/pr111720-0.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-1.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-10.c   | 18 
 .../gcc.target/riscv/rvv/base/pr111720-2.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-3.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-4.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-5.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-6.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-7.c| 21 +++
 .../gcc.target/riscv/rvv/base/pr111720-8.c| 18 
 .../gcc.target/riscv/rvv/base/pr111720-9.c| 15 +
 15 files changed, 209 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-0.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-1.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-10.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-4.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-5.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-6.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-7.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-8.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr111720-9.c

diff --git a/gcc/dse.cc b/gcc/dse.cc
index 1a85dae1f8c..40c4c29d07e 100644
--- a/gcc/dse.cc
+++ b/gcc/dse.cc
@@ -1900,8 +1900,11 @@ get_stored_val (store_info *store_info, machine_mode 
read_mode,
   else
 gap = read_offset - store_info->offset;
 
-  if (gap.is_constant () && maybe_ne (gap, 0))
+  if (maybe_ne (gap, 0))
 {
+  if (!gap.is_constant ())
+   return NULL_RTX;
+
   poly_int64 shift = gap * BITS_PER_UNIT;
   poly_int64 access_size = GET_MODE_SIZE (read_mode) + gap;
   read_reg = find_shift_sequence (access_size, store_info, read_mode,
@@ -1940,6 +1943,10 @@ get_stored_val (store_info *stor

Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding

Hi Vladimir,

While you're starting your review, please review v3 version that fixes 
some ICE issues, thanks.


https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636178.html

On 2023/11/12 20:01, Lehua Ding wrote:

Hi Vladimir,

On 2023/11/10 4:24, Vladimir Makarov wrote:


On 11/7/23 22:47, Lehua Ding wrote:


Lehua Ding (7):
   ira: Refactor the handling of register conflicts to make it more
 general
   ira: Add live_subreg problem and apply to ira pass
   ira: Support subreg live range track
   ira: Support subreg copy
   ira: Add all nregs >= 2 pseudos to tracke subreg list
   lra: Apply live_subreg df_problem to lra pass
   lra: Support subreg live range track and conflict detect

Thank you very much for addressing subreg RA.  It is a big work.  I 
wanted to address this long time ago but have no time to do this by 
myself.


I tried to evaluate your patches on x86-64 (i7-9700k) release mode 
GCC. I used -O3 for SPEC2017 compilation.


Here are the results:

    baseline baseline(+patches)
specint2017:  8.51 vs 8.58 (+0.8%)
specfp2017:   21.1 vs 21.1 (+0%)
compile time: 2426.41s vs 2580.58s (+6.4%)

Spec2017 average code size change: -0.07%

Improving specint by 0.8% is impressive for me.

Unfortunately, it is achieved by decreasing compilation speed by 6.4% 
(although on smaller benchmark I saw only 3% slowdown). I don't know 
how but we should mitigate this speed degradation.  May be we can find 
a hot spot in the new code (but I think it is not a linear search 
pointed by Richard Biener as the object vectors most probably contain 
1-2 elements) and this code spot can be improved, or we could use this 
only for -O3/fast, or the code can be function or target dependent.


I also find GCC consumes more memory with the patches. May be it can 
be improved too (although I am not sure about this).


Thanks for the specint performance data. I'll do my best to get the 
compile time and memory issues fixed. I'm very curious to know if the 
way used to solve the subreg coalesce problem makes sense to you?


I'll start to review the patches on the next week.  I don't expect 
that I'll find something serious to reject the patches but again we 
should work on mitigation of the compilation speed problem.  We can 
fill a new PR for this and resolve the problem during the release cycle.




--
Best,
Lehua (RiVAI)
lehua.d...@rivai.ai


[PATCH V3 7/7] lra: Support subreg live range track and conflict detect

2023-11-12 Thread Lehua Ding
This patch supports tracking the liveness of a subreg in a lra pass, with the
goal of getting it to agree with ira's register allocation scheme. There is some
duplication, maybe in the future this part of the code logic can be harmonized.

gcc/ChangeLog:

* ira-build.cc (setup_pseudos_has_subreg_object):
Collect new data for lra to use.
(ira_build): Ditto.
* lra-assigns.cc (set_offset_conflicts): New function.
(setup_live_pseudos_and_spill_after_risky_transforms): Adjust.
(lra_assign): Ditto.
* lra-constraints.cc (process_alt_operands): Ditto.
* lra-int.h (GCC_LRA_INT_H): Ditto.
(struct lra_live_range): Ditto.
(struct lra_insn_reg): Ditto.
(get_range_hard_regs): New.
(get_nregs): New.
(has_subreg_object_p): New.
* lra-lives.cc (INCLUDE_VECTOR): Adjust.
(lra_live_range_pool): Ditto.
(create_live_range): Ditto.
(lra_merge_live_ranges): Ditto.
(update_pseudo_point): Ditto.
(mark_regno_live): Ditto.
(mark_regno_dead): Ditto.
(process_bb_lives): Ditto.
(remove_some_program_points_and_update_live_ranges): Ditto.
(lra_print_live_range_list): Ditto.
(class subreg_live_item): New.
(create_subregs_live_ranges): New.
(lra_create_live_ranges_1): Ditto.
* lra.cc (get_range_blocks): Ditto.
(get_range_hard_regs): Ditto.
(new_insn_reg): Ditto.
(collect_non_operand_hard_regs): Ditto.
(initialize_lra_reg_info_element): Ditto.
(reg_same_range_p): New.
(add_regs_to_insn_regno_info): Adjust.

---
 gcc/ira-build.cc   |  31 
 gcc/lra-assigns.cc | 111 --
 gcc/lra-constraints.cc |  18 ++-
 gcc/lra-int.h  |  31 
 gcc/lra-lives.cc   | 340 ++---
 gcc/lra.cc | 139 +++--
 6 files changed, 585 insertions(+), 85 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index f88aaef..bb29627d375 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -95,6 +95,9 @@ int ira_copies_num;
basic block.  */
 static int last_basic_block_before_change;
 
+/* Record these pseudos which has subreg object. Used by LRA pass.  */
+bitmap_head pseudos_has_subreg_object;
+
 /* Initialize some members in loop tree node NODE.  Use LOOP_NUM for
the member loop_num.  */
 static void
@@ -3711,6 +3714,33 @@ update_conflict_hard_reg_costs (void)
 }
 }
 
+/* Setup speudos_has_subreg_object.  */
+static void
+setup_pseudos_has_subreg_object ()
+{
+  bitmap_initialize (&pseudos_has_subreg_object, ®_obstack);
+  ira_allocno_t a;
+  ira_allocno_iterator ai;
+  FOR_EACH_ALLOCNO (a, ai)
+if (has_subreg_object_p (a))
+  {
+   bitmap_set_bit (&pseudos_has_subreg_object, ALLOCNO_REGNO (a));
+   if (ira_dump_file != NULL)
+ {
+   fprintf (ira_dump_file,
+"  a%d(r%d, nregs: %d) has subreg objects:\n",
+ALLOCNO_NUM (a), ALLOCNO_REGNO (a), ALLOCNO_NREGS (a));
+   ira_allocno_object_iterator oi;
+   ira_object_t obj;
+   FOR_EACH_ALLOCNO_OBJECT (a, obj, oi)
+ fprintf (ira_dump_file, "object %d: start: %d, nregs: %d\n",
+  OBJECT_INDEX (obj), OBJECT_START (obj),
+  OBJECT_NREGS (obj));
+   fprintf (ira_dump_file, "\n");
+ }
+  }
+}
+
 /* Create a internal representation (IR) for IRA (allocnos, copies,
loop tree nodes).  The function returns TRUE if we generate loop
structure (besides nodes representing all function and the basic
@@ -3731,6 +3761,7 @@ ira_build (void)
   create_allocnos ();
   ira_costs ();
   create_allocno_objects ();
+  setup_pseudos_has_subreg_object ();
   ira_create_allocno_live_ranges ();
   remove_unnecessary_regions (false);
   ira_compress_allocno_live_ranges ();
diff --git a/gcc/lra-assigns.cc b/gcc/lra-assigns.cc
index d2ebcfd5056..6588a740162 100644
--- a/gcc/lra-assigns.cc
+++ b/gcc/lra-assigns.cc
@@ -1131,6 +1131,52 @@ assign_hard_regno (int hard_regno, int regno)
 /* Array used for sorting different pseudos.  */
 static int *sorted_pseudos;
 
+/* The detail conflict offsets If two live ranges conflict. Use to record
+   partail conflict.  */
+static bitmap_head live_range_conflicts;
+
+/* Set the conflict offset of the two registers REGNO1 and REGNO2. Use the
+   regno with bigger nregs as the base.  */
+static void
+set_offset_conflicts (int regno1, int regno2)
+{
+  gcc_assert (reg_renumber[regno1] >= 0 && reg_renumber[regno2] >= 0);
+  int nregs1 = get_nregs (regno1);
+  int nregs2 = get_nregs (regno2);
+  if (nregs1 < nregs2)
+{
+  std::swap (nregs1, nregs2);
+  std::swap (regno1, regno2);
+}
+
+  lra_live_range_t r1 = lra_reg_info[regno1].live_ranges;
+  lra_live_range_t r2 = lra_reg_info[regno2].live_ranges;
+  int total = nregs1;
+
+  bitmap_clear (&live_range_confli

[PATCH V3 5/7] ira: Add all nregs >= 2 pseudos to tracke subreg list

2023-11-12 Thread Lehua Ding
This patch relax the subreg track capability to all subreg registers.

gcc/ChangeLog:

* ira-build.cc (get_reg_unit_size): New.
(has_same_nregs): New.
(ira_set_allocno_class): Adjust.

---
 gcc/ira-build.cc | 41 -
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index 13f0f7336ed..f88aaef 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -607,6 +607,37 @@ ira_create_allocno (int regno, bool cap_p,
   return a;
 }
 
+/* Return single register size of allocno A.  */
+static poly_int64
+get_reg_unit_size (ira_allocno_t a)
+{
+  enum reg_class aclass = ALLOCNO_CLASS (a);
+  gcc_assert (aclass != NO_REGS);
+  machine_mode mode = ALLOCNO_MODE (a);
+  int nregs = ALLOCNO_NREGS (a);
+  poly_int64 block_size = REGMODE_NATURAL_SIZE (mode);
+  int nblocks = get_nblocks (mode);
+  gcc_assert (nblocks % nregs == 0);
+  return block_size * (nblocks / nregs);
+}
+
+/* Return true if TARGET_CLASS_MAX_NREGS and TARGET_HARD_REGNO_NREGS results is
+   same. It should be noted that some targets may not implement these two very
+   uniformly, and need to be debugged step by step. For example, in V3x1DI mode
+   in AArch64, TARGET_CLASS_MAX_NREGS returns 2 but TARGET_HARD_REGNO_NREGS
+   returns 3. They are in conflict and need to be repaired in the Hook of
+   AArch64.  */
+static bool
+has_same_nregs (ira_allocno_t a)
+{
+  for (int i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+if (REGNO_REG_CLASS (i) != NO_REGS
+   && reg_class_subset_p (REGNO_REG_CLASS (i), ALLOCNO_CLASS (a))
+   && ALLOCNO_NREGS (a) != hard_regno_nregs (i, ALLOCNO_MODE (a)))
+  return false;
+  return true;
+}
+
 /* Set up register class for A and update its conflict hard
registers.  */
 void
@@ -624,12 +655,12 @@ ira_set_allocno_class (ira_allocno_t a, enum reg_class 
aclass)
 
   if (aclass == NO_REGS)
 return;
-  /* SET the unit_size of one register.  */
-  machine_mode mode = ALLOCNO_MODE (a);
-  int nregs = ira_reg_class_max_nregs[aclass][mode];
-  if (nregs == 2 && maybe_eq (GET_MODE_SIZE (mode), nregs * UNITS_PER_WORD))
+  gcc_assert (!ALLOCNO_TRACK_SUBREG_P (a));
+  /* Set unit size and track_subreg_p flag for pseudo which need occupied multi
+ hard regs.  */
+  if (ALLOCNO_NREGS (a) > 1 && has_same_nregs (a))
 {
-  ALLOCNO_UNIT_SIZE (a) = UNITS_PER_WORD;
+  ALLOCNO_UNIT_SIZE (a) = get_reg_unit_size (a);
   ALLOCNO_TRACK_SUBREG_P (a) = true;
   return;
 }
-- 
2.36.3



[PATCH V3 3/7] ira: Support subreg live range track

2023-11-12 Thread Lehua Ding
This patch supports tracking subreg liveness. It first extends
ira_object_t objects[2] to std::vector objects,
which can hold more than one object, and is used to collect all
access via subreg in program and the partial_in and partial_out
of the basic block live in/out.

Then there is a modification to the way conflicts between registers
are detected, for example, if a object conflicts with b object, then
the offset and size of the object relative to the allocno it belongs
to need to be taken into account to compute the conflict registers
between allocno and allocno.

gcc/ChangeLog:

* hard-reg-set.h (struct HARD_REG_SET): New shift operator.
* ira-build.cc (ira_create_object): Adjust.
(find_object): New.
(find_object_anyway): New.
(ira_create_allocno): Adjust.
(get_range): New.
(ira_copy_allocno_objects): New.
(merge_hard_reg_conflicts): Adjust copy.
(create_cap_allocno): Adjust.
(find_subreg_p): New.
(add_subregs): New.
(create_insn_allocnos): Collect subreg.
(create_bb_allocnos): Ditto.
(move_allocno_live_ranges): Adjust.
(copy_allocno_live_ranges): Adjust.
(setup_min_max_allocno_live_range_point): Adjust.
* ira-color.cc (INCLUDE_MAP): include map.
(setup_left_conflict_sizes_p): Adjust conflict size.
(setup_profitable_hard_regs): Adjust.
(get_conflict_and_start_profitable_regs): Adjust.
(check_hard_reg_p): Adjust conflict check.
(assign_hard_reg): Adjust.
(push_allocno_to_stack): Adjust conflict size.
(improve_allocation): Adjust.
* ira-conflicts.cc (record_object_conflict): Simplify.
(build_object_conflicts): Adjust.
(build_conflicts): Adjust.
(print_allocno_conflicts): Adjust.
* ira-emit.cc (modify_move_list): Adjust.
* ira-int.h (struct ira_object): Adjust struct.
(struct ira_allocno): Adjust struct.
(ALLOCNO_NUM_OBJECTS): New accessor.
(ALLOCNO_UNIT_SIZE): Ditto.
(ALLOCNO_TRACK_SUBREG_P): Ditto.
(ALLOCNO_NREGS): Ditto.
(OBJECT_SUBWORD): Ditto.
(OBJECT_INDEX): Ditto.
(OBJECT_START): Ditto.
(OBJECT_NREGS): Ditto.
(find_object): Exported.
(find_object_anyway): Ditto.
(ira_copy_allocno_objects): Ditto.
(has_subreg_object_p): Ditto.
(get_full_object): Ditto.
* ira-lives.cc (INCLUDE_VECTOR): Include vector.
(add_onflict_hard_regs): New.
(add_onflict_hard_reg): New.
(make_hard_regno_dead): Adjust.
(make_object_live): Adjust.
(update_allocno_pressure_excess_length): Adjust.
(make_object_dead): Adjust.
(mark_pseudo_regno_live): Adjust.
(add_subreg_point): New.
(mark_pseudo_object_live): Adjust.
(mark_pseudo_regno_subword_live): Adjust.
(mark_pseudo_regno_subreg_live): Adjust.
(mark_pseudo_regno_subregs_live): Adjust.
(mark_pseudo_reg_live): Adjust.
(mark_pseudo_regno_dead): Adjust.
(mark_pseudo_object_dead): Adjust.
(mark_pseudo_regno_subword_dead): Adjust.
(mark_pseudo_regno_subreg_dead): Adjust.
(mark_pseudo_reg_dead): Adjust.
(process_single_reg_class_operands): Adjust.
(process_out_of_region_eh_regs): Adjust.
(add_conflict_from_region_landing_pads): Adjust.
(process_bb_node_lives): Adjust.
(class subreg_live_item): New class.
(create_subregs_live_ranges): New function.
(ira_create_allocno_live_ranges): Adjust.
* ira.cc (check_allocation): Adjust.

---
 gcc/hard-reg-set.h   |  33 +++
 gcc/ira-build.cc | 235 +---
 gcc/ira-color.cc | 302 +-
 gcc/ira-conflicts.cc |  48 ++---
 gcc/ira-emit.cc  |   2 +-
 gcc/ira-int.h|  57 -
 gcc/ira-lives.cc | 500 ---
 gcc/ira.cc   |  52 ++---
 8 files changed, 907 insertions(+), 322 deletions(-)

diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index b0bb9bce074..760eadba186 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -113,6 +113,39 @@ struct HARD_REG_SET
 return !operator== (other);
   }
 
+  HARD_REG_SET
+  operator>> (unsigned int shift_amount) const
+  {
+if (shift_amount == 0)
+  return *this;
+
+HARD_REG_SET res;
+unsigned int total_bits = sizeof (HARD_REG_ELT_TYPE) * 8;
+if (shift_amount >= total_bits)
+  {
+   unsigned int n_elt = shift_amount % total_bits;
+   shift_amount -= n_elt * total_bits;
+   for (unsigned int i = 0; i < ARRAY_SIZE (elts) - n_elt - 1; i += 1)
+ res.elts[i] = elts[i + n_elt];
+   /* clear upper n_elt elements.  */
+   for (unsigned int i = 0; i < n_elt; i += 1)
+ res.elts[ARRAY_SIZE (elts) - 1 - i] = 0;
+  }
+
+if (shift_amount > 0)
+  {
+   /* The left bits of an element be

[PATCH V3 6/7] lra: Switch to live_subreg data flow

2023-11-12 Thread Lehua Ding
This patch switches the live_reg data in lra to live_subreg data,
and the situation will be more complicated than in ira because
this part of the data is modified in lra also and the live_subreg
data will be recalculated.

gcc/ChangeLog:

* lra-coalesce.cc (update_live_info):
Adjust to new live subreg data.
(lra_coalesce): Ditto.
* lra-constraints.cc (update_ebb_live_info): Ditto.
(get_live_on_other_edges): Ditto.
(inherit_in_ebb): Ditto.
(lra_inheritance): Ditto.
(fix_bb_live_info): Ditto.
(remove_inheritance_pseudos): Ditto.
* lra-int.h (GCC_LRA_INT_H): Ditto.
* lra-lives.cc (class bb_data_pseudos): Ditto.
(make_hard_regno_live): Ditto.
(make_hard_regno_dead): Ditto.
(mark_regno_live): Ditto.
(mark_regno_dead): Ditto.
(live_trans_fun): Ditto.
(live_con_fun_0): Ditto.
(live_con_fun_n): Ditto.
(initiate_live_solver): Ditto.
(finish_live_solver): Ditto.
(process_bb_lives): Ditto.
(lra_create_live_ranges_1): Ditto.
* lra-remat.cc (dump_candidates_and_remat_bb_data): Ditto.
(calculate_livein_cands): Ditto.
(do_remat): Ditto.
* lra-spills.cc (spill_pseudos): Ditto.

---
 gcc/lra-coalesce.cc|  20 ++-
 gcc/lra-constraints.cc |  93 +---
 gcc/lra-int.h  |   2 +
 gcc/lra-lives.cc   | 328 -
 gcc/lra-remat.cc   |  13 +-
 gcc/lra-spills.cc  |  22 ++-
 6 files changed, 374 insertions(+), 104 deletions(-)

diff --git a/gcc/lra-coalesce.cc b/gcc/lra-coalesce.cc
index 04a5bbd714b..abfc54f1cc2 100644
--- a/gcc/lra-coalesce.cc
+++ b/gcc/lra-coalesce.cc
@@ -188,19 +188,25 @@ static bitmap_head used_pseudos_bitmap;
 /* Set up USED_PSEUDOS_BITMAP, and update LR_BITMAP (a BB live info
bitmap).  */
 static void
-update_live_info (bitmap lr_bitmap)
+update_live_info (bitmap all, bitmap full, bitmap partial)
 {
   unsigned int j;
   bitmap_iterator bi;
 
   bitmap_clear (&used_pseudos_bitmap);
-  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
+  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, all,
FIRST_PSEUDO_REGISTER, j, bi)
 bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
   if (! bitmap_empty_p (&used_pseudos_bitmap))
 {
-  bitmap_and_compl_into (lr_bitmap, &coalesced_pseudos_bitmap);
-  bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
+  bitmap_and_compl_into (all, &coalesced_pseudos_bitmap);
+  bitmap_ior_into (all, &used_pseudos_bitmap);
+
+  bitmap_and_compl_into (full, &coalesced_pseudos_bitmap);
+  bitmap_ior_and_compl_into (full, &used_pseudos_bitmap, partial);
+
+  bitmap_and_compl_into (partial, &coalesced_pseudos_bitmap);
+  bitmap_ior_and_compl_into (partial, &used_pseudos_bitmap, full);
 }
 }
 
@@ -303,8 +309,10 @@ lra_coalesce (void)
   bitmap_initialize (&used_pseudos_bitmap, ®_obstack);
   FOR_EACH_BB_FN (bb, cfun)
 {
-  update_live_info (df_get_live_in (bb));
-  update_live_info (df_get_live_out (bb));
+  update_live_info (DF_LIVE_SUBREG_IN (bb), DF_LIVE_SUBREG_FULL_IN (bb),
+   DF_LIVE_SUBREG_PARTIAL_IN (bb));
+  update_live_info (DF_LIVE_SUBREG_OUT (bb), DF_LIVE_SUBREG_FULL_OUT (bb),
+   DF_LIVE_SUBREG_PARTIAL_OUT (bb));
   FOR_BB_INSNS_SAFE (bb, insn, next)
if (INSN_P (insn)
&& bitmap_bit_p (&involved_insns_bitmap, INSN_UID (insn)))
diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 0607c8be7cb..c3ad846b97b 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -6571,34 +6571,75 @@ update_ebb_live_info (rtx_insn *head, rtx_insn *tail)
{
  if (prev_bb != NULL)
{
- /* Update df_get_live_in (prev_bb):  */
+ /* Update subreg live (prev_bb):  */
+ bitmap subreg_all_in = DF_LIVE_SUBREG_IN (prev_bb);
+ bitmap subreg_full_in = DF_LIVE_SUBREG_FULL_IN (prev_bb);
+ bitmap subreg_partial_in = DF_LIVE_SUBREG_PARTIAL_IN (prev_bb);
+ subregs_live *range_in = DF_LIVE_SUBREG_RANGE_IN (prev_bb);
  EXECUTE_IF_SET_IN_BITMAP (&check_only_regs, 0, j, bi)
if (bitmap_bit_p (&live_regs, j))
- bitmap_set_bit (df_get_live_in (prev_bb), j);
-   else
- bitmap_clear_bit (df_get_live_in (prev_bb), j);
+ {
+   bitmap_set_bit (subreg_all_in, j);
+   bitmap_set_bit (subreg_full_in, j);
+   if (bitmap_bit_p (subreg_partial_in, j))
+ {
+   bitmap_clear_bit (subreg_partial_in, j);
+   range_in->remove_live (j);
+ }
+ }
+   else if (bitmap_bit_p (subreg_all_in, j))
+ {
+   bi

Re: [PATCH V2 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding
These patches found a new bug and I resend a v3 version, I'm sorry about 
this.


V3: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636178.html

On 2023/11/12 17:58, Lehua Ding wrote:

Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).

Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):

```
#include 

void
foo (int32_t *in, int32_t *out, size_t m)
{
   vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
   vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
   vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
   for (size_t i = 0; i < m; i++)
 {
   v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
   v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
 }
   *(vint32m1_t*)(out+4*0) = v0;
   *(vint32m1_t*)(out+4*1) = v1;
}
```

Before these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v4,0(a0)
vmv1r.v v2,v4
vmv1r.v v1,v5
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v1,v1,v1
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v1,0(a1)
ret
```

After these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v2,0(a0)
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v3,v3,v3
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v3,0(a1)
ret
```

As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.

If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.

```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
 (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) 
"/app/example.c":7:19 998 {*movrvvm1si_whole}
  (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
 (subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) 
"/app/example.c":8:19 998 {*movrvvm1si_whole}
  (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
 (nil)))
```

ira dump:

;; a1(r136,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;;
;; ...
   Popping a1(r135,l0)  -- assign reg 97
   Popping a3(r136,l0)  -- assign reg 98
   Popping a4(r137,l0)  -- assign reg 15
   Popping a5(r140,l0)  -- assign reg 12
   Popping a10(r145,l0)  -- assign reg 12
   Popping a2(r139,l0)  -- assign reg 11
   Popping a9(r144,l0)  -- assign reg 11
   Popping a0(r142,l0)  -- assign reg 11
   Popping a6(r134,l0)  -- assign reg 100
   Popping a7(r143,l0)  -- assign reg 10
   Popping a8(r141,l0)  -- assign reg 15

The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):

```
#include 

int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, 
int64_t*out)
{
   svint64x4_t result = svld4_s64 (pg, base);
   svint64_t v0 = svget4_s64(result, 0);
   svint64_t v1 = svget4_s64(result, 1);
   svint64_t v2 = svget4_s64(result, 2);
   svint64_t v3 = svget4_s64(result, 3);

   for (int i = 0; i < n; i += 1)
 {
 svint64_t v18 = svld1_s64(pg, in1);
 svint64_t v19 = svld1_s64(pg, in2);
 v0 = svmad_s64_z(pg, v0, v18, v19);
 v1 = svmad_s64_z(pg, v1, v18, v19);
 v2 = svmad_s64_z(pg, v2, v18, v19);
 v3 = svmad_s64_z(pg, v3, v18, v1

[PATCH V3 4/7] ira: Support subreg copy

2023-11-12 Thread Lehua Ding
This patch changes the previous way of creating a copy between allocnos to 
objects.

gcc/ChangeLog:

* ira-build.cc (find_allocno_copy): Removed.
(find_object): New.
(ira_create_copy): Adjust.
(add_allocno_copy_to_list): Adjust.
(swap_allocno_copy_ends_if_necessary): Adjust.
(ira_add_allocno_copy): Adjust.
(print_copy): Adjust.
(print_allocno_copies): Adjust.
(ira_flattening): Adjust.
* ira-color.cc (INCLUDE_VECTOR): Include vector.
(struct allocno_color_data): Adjust.
(struct allocno_hard_regs_subnode): Adjust.
(form_allocno_hard_regs_nodes_forest): Adjust.
(update_left_conflict_sizes_p): Adjust.
(struct update_cost_queue_elem): Adjust.
(queue_update_cost): Adjust.
(get_next_update_cost): Adjust.
(update_costs_from_allocno): Adjust.
(update_conflict_hard_regno_costs): Adjust.
(assign_hard_reg): Adjust.
(objects_conflict_by_live_ranges_p): New.
(allocno_thread_conflict_p): Adjust.
(object_thread_conflict_p): Ditto.
(merge_threads): Ditto.
(form_threads_from_copies): Ditto.
(form_threads_from_bucket): Ditto.
(form_threads_from_colorable_allocno): Ditto.
(init_allocno_threads): Ditto.
(add_allocno_to_bucket): Ditto.
(delete_allocno_from_bucket): Ditto.
(allocno_copy_cost_saving): Ditto.
(color_allocnos): Ditto.
(color_pass): Ditto.
(update_curr_costs): Ditto.
(coalesce_allocnos): Ditto.
(ira_reuse_stack_slot): Ditto.
(ira_initiate_assign): Ditto.
(ira_finish_assign): Ditto.
* ira-conflicts.cc (allocnos_conflict_for_copy_p): Ditto.
(REG_SUBREG_P): Ditto.
(subreg_move_p): New.
(regs_non_conflict_for_copy_p): New.
(subreg_reg_align_and_times_p): New.
(process_regs_for_copy): Ditto.
(add_insn_allocno_copies): Ditto.
(propagate_copies): Ditto.
* ira-emit.cc (add_range_and_copies_from_move_list): Ditto.
* ira-int.h (struct ira_allocno_copy): Ditto.
(ira_add_allocno_copy): Ditto.
(find_object): Exported.
(subreg_move_p): Exported.
* ira.cc (print_redundant_copies): Exported.

---
 gcc/ira-build.cc | 154 +++-
 gcc/ira-color.cc | 541 +++
 gcc/ira-conflicts.cc | 173 +++---
 gcc/ira-emit.cc  |  10 +-
 gcc/ira-int.h|  10 +-
 gcc/ira.cc   |   5 +-
 6 files changed, 646 insertions(+), 247 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index a32693e69e4..13f0f7336ed 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -36,9 +36,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "subreg-live-range.h"
 
-static ira_copy_t find_allocno_copy (ira_allocno_t, ira_allocno_t, rtx_insn *,
-ira_loop_tree_node_t);
-
 /* The root of the loop tree corresponding to the all function.  */
 ira_loop_tree_node_t ira_loop_tree_root;
 
@@ -520,6 +517,16 @@ find_object (ira_allocno_t a, poly_int64 offset, 
poly_int64 size)
   return find_object (a, subreg_start, subreg_nregs);
 }
 
+/* Return object in allocno A for REG.  */
+ira_object_t
+find_object (ira_allocno_t a, rtx reg)
+{
+  if (has_subreg_object_p (a) && read_modify_subreg_p (reg))
+return find_object (a, SUBREG_BYTE (reg), GET_MODE_SIZE (GET_MODE (reg)));
+  else
+return find_object (a, 0, ALLOCNO_NREGS (a));
+}
+
 /* Return the object in allocno A which match START & NREGS.  Create when not
found.  */
 ira_object_t
@@ -1503,27 +1510,36 @@ initiate_copies (void)
 /* Return copy connecting A1 and A2 and originated from INSN of
LOOP_TREE_NODE if any.  */
 static ira_copy_t
-find_allocno_copy (ira_allocno_t a1, ira_allocno_t a2, rtx_insn *insn,
+find_allocno_copy (ira_object_t obj1, ira_object_t obj2, rtx_insn *insn,
   ira_loop_tree_node_t loop_tree_node)
 {
   ira_copy_t cp, next_cp;
-  ira_allocno_t another_a;
+  ira_object_t another_obj;
 
+  ira_allocno_t a1 = OBJECT_ALLOCNO (obj1);
   for (cp = ALLOCNO_COPIES (a1); cp != NULL; cp = next_cp)
 {
-  if (cp->first == a1)
+  ira_allocno_t first_a = OBJECT_ALLOCNO (cp->first);
+  ira_allocno_t second_a = OBJECT_ALLOCNO (cp->second);
+  if (first_a == a1)
{
  next_cp = cp->next_first_allocno_copy;
- another_a = cp->second;
+ if (cp->first == obj1)
+   another_obj = cp->second;
+ else
+   continue;
}
-  else if (cp->second == a1)
+  else if (second_a == a1)
{
  next_cp = cp->next_second_allocno_copy;
- another_a = cp->first;
+ if (cp->second == obj1)
+   another_obj = cp->first;
+ else
+   continue;
}
   else
gcc_unreachable ();
-  if (another_a == a2 && c

[PATCH V3 1/7] df: Add DF_LIVE_SUBREG problem

2023-11-12 Thread Lehua Ding
This patch adds a live_subreg problem to extend the original live_reg to
track the liveness of subreg. We will only try to trace speudo registers
who's mode size is a multiple of nature size and eventually a small portion
of the inside will appear to use subreg. With live_reg problem, live_subreg
prbolem will have the following output. full_in/out mean the entire pesudo
live in/out, partial_in/out mean the subregs of the pesudo are live in/out,
and range_in/out indicates which part of the pesudo is live. all_in/out is
the union of full_in/out and partial_in/out:

  bitmap_head all_in, full_in;
  bitmap_head all_out, full_out;
  bitmap_head partial_in;
  bitmap_head partial_out;
  subregs_live *range_in = NULL;
  subregs_live *range_out = NULL;

gcc/ChangeLog:

* Makefile.in: Add new object file.
* df-problems.cc (struct df_live_subreg_problem_data):
The data of the new live_subreg problem.
(need_track_subreg): New function.
(get_range): Ditto.
(remove_subreg_range): Ditto.
(add_subreg_range): Ditto.
(df_live_subreg_free_bb_info): Ditto.
(df_live_subreg_alloc): Ditto.
(df_live_subreg_reset): Ditto.
(df_live_subreg_bb_local_compute): Ditto.
(df_live_subreg_local_compute): Ditto.
(df_live_subreg_init): Ditto.
(df_live_subreg_check_result): Ditto.
(df_live_subreg_confluence_0): Ditto.
(df_live_subreg_confluence_n): Ditto.
(df_live_subreg_transfer_function): Ditto.
(df_live_subreg_finalize): Ditto.
(df_live_subreg_free): Ditto.
(df_live_subreg_top_dump): Ditto.
(df_live_subreg_bottom_dump): Ditto.
(df_live_subreg_add_problem): Ditto.
* df.h (enum df_problem_id): Add live_subreg id.
(DF_LIVE_SUBREG_INFO): Data accessor.
(DF_LIVE_SUBREG_IN): Ditto.
(DF_LIVE_SUBREG_OUT): Ditto.
(DF_LIVE_SUBREG_FULL_IN): Ditto.
(DF_LIVE_SUBREG_FULL_OUT): Ditto.
(DF_LIVE_SUBREG_PARTIAL_IN): Ditto.
(DF_LIVE_SUBREG_PARTIAL_OUT): Ditto.
(DF_LIVE_SUBREG_RANGE_IN): Ditto.
(DF_LIVE_SUBREG_RANGE_OUT): Ditto.
(class subregs_live): New class.
(class basic_block_subreg_live_info): Ditto.
(class df_live_subreg_bb_info): Ditto.
(df_live_subreg): Ditto.
(df_live_subreg_add_problem): Ditto.
(df_live_subreg_finalize): Ditto.
(class subreg_range): Ditto.
(need_track_subreg): Ditto.
(remove_subreg_range): Ditto.
(add_subreg_range): Ditto.
(df_live_subreg_get_bb_info): Ditto.
* regs.h (get_nblocks): Helper function.
* timevar.def (TV_DF_LIVE_SUBREG): New timevar.
* subreg-live-range.cc: New file.
* subreg-live-range.h: New file.

---
 gcc/Makefile.in  |   1 +
 gcc/df-problems.cc   | 889 ++-
 gcc/df.h |  67 +++
 gcc/regs.h   |   7 +
 gcc/subreg-live-range.cc | 628 +++
 gcc/subreg-live-range.h  | 333 +++
 gcc/timevar.def  |   1 +
 7 files changed, 1925 insertions(+), 1 deletion(-)
 create mode 100644 gcc/subreg-live-range.cc
 create mode 100644 gcc/subreg-live-range.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 29cec21c825..e4403b5a30c 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1675,6 +1675,7 @@ OBJS = \
store-motion.o \
streamer-hooks.o \
stringpool.o \
+subreg-live-range.o \
substring-locations.o \
target-globals.o \
targhooks.o \
diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
index d2cfaf7f50f..2585c762fd1 100644
--- a/gcc/df-problems.cc
+++ b/gcc/df-problems.cc
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "rtl.h"
 #include "df.h"
+#include "subreg-live-range.h"
 #include "memmodel.h"
 #include "tm_p.h"
 #include "insn-config.h"
@@ -1344,8 +1345,894 @@ df_lr_verify_transfer_functions (void)
   bitmap_clear (&all_blocks);
 }
 
+/*
+   REGISTER AND SUBREG LIVES
+   Like DF_RL, but fine-grained tracking of subreg lifecycle.
+   
*/
+
+/* Private data used to verify the solution for this problem.  */
+struct df_live_subreg_problem_data
+{
+  /* An obstack for the bitmaps we need for this problem.  */
+  bitmap_obstack live_subreg_bitmaps;
+  bool has_subreg_live_p;
+};
+
+/* Helper functions */
+
+/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
+bool
+need_track_subreg (int regno, machine_mode reg_mode)
+{
+  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
+  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
+  return maybe_gt (total_size, natural_size)
+&& multiple_p (total_size, natural_size)
+&& regno >= FIRST_PSEUDO_REGISTER;
+

[PATCH V3 2/7] ira: Switch to live_subreg data

2023-11-12 Thread Lehua Ding
This patch switch the use of live_reg data to live_subreg data.

gcc/ChangeLog:

* ira-build.cc (create_bb_allocnos): Switch.
(create_loop_allocnos): Ditto.
* ira-color.cc (ira_loop_edge_freq): Ditto.
* ira-emit.cc (generate_edge_moves): Ditto.
(add_ranges_and_copies): Ditto.
* ira-lives.cc (process_out_of_region_eh_regs): Ditto.
(add_conflict_from_region_landing_pads): Ditto.
(process_bb_node_lives): Ditto.
* ira.cc (find_moveable_pseudos): Ditto.
(interesting_dest_for_shprep_1): Ditto.
(allocate_initial_values): Ditto.
(ira): Ditto.

---
 gcc/ira-build.cc |  7 ---
 gcc/ira-color.cc |  8 
 gcc/ira-emit.cc  | 12 ++--
 gcc/ira-lives.cc |  7 ---
 gcc/ira.cc   | 16 +---
 5 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index 93e46033170..f931c6e304c 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -1919,7 +1919,8 @@ create_bb_allocnos (ira_loop_tree_node_t bb_node)
   create_insn_allocnos (PATTERN (insn), NULL, false);
   /* It might be a allocno living through from one subloop to
  another.  */
-  EXECUTE_IF_SET_IN_REG_SET (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, i, bi)
+  EXECUTE_IF_SET_IN_REG_SET (DF_LIVE_SUBREG_IN (bb), FIRST_PSEUDO_REGISTER,
+i, bi)
 if (ira_curr_regno_allocno_map[i] == NULL)
   ira_create_allocno (i, false, ira_curr_loop_tree_node);
 }
@@ -1935,9 +1936,9 @@ create_loop_allocnos (edge e)
   bitmap_iterator bi;
   ira_loop_tree_node_t parent;
 
-  live_in_regs = df_get_live_in (e->dest);
+  live_in_regs = DF_LIVE_SUBREG_IN (e->dest);
   border_allocnos = ira_curr_loop_tree_node->border_allocnos;
-  EXECUTE_IF_SET_IN_REG_SET (df_get_live_out (e->src),
+  EXECUTE_IF_SET_IN_REG_SET (DF_LIVE_SUBREG_OUT (e->src),
 FIRST_PSEUDO_REGISTER, i, bi)
 if (bitmap_bit_p (live_in_regs, i))
   {
diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index f2e8ea34152..4aa3e316282 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -2783,8 +2783,8 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
   FOR_EACH_EDGE (e, ei, loop_node->loop->header->preds)
if (e->src != loop_node->loop->latch
&& (regno < 0
-   || (bitmap_bit_p (df_get_live_out (e->src), regno)
-   && bitmap_bit_p (df_get_live_in (e->dest), regno
+   || (bitmap_bit_p (DF_LIVE_SUBREG_OUT (e->src), regno)
+   && bitmap_bit_p (DF_LIVE_SUBREG_IN (e->dest), regno
  freq += EDGE_FREQUENCY (e);
 }
   else
@@ -2792,8 +2792,8 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
   auto_vec edges = get_loop_exit_edges (loop_node->loop);
   FOR_EACH_VEC_ELT (edges, i, e)
if (regno < 0
-   || (bitmap_bit_p (df_get_live_out (e->src), regno)
-   && bitmap_bit_p (df_get_live_in (e->dest), regno)))
+   || (bitmap_bit_p (DF_LIVE_SUBREG_OUT (e->src), regno)
+   && bitmap_bit_p (DF_LIVE_SUBREG_IN (e->dest), regno)))
  freq += EDGE_FREQUENCY (e);
 }
 
diff --git a/gcc/ira-emit.cc b/gcc/ira-emit.cc
index bcc4f09f7c4..84ed482e568 100644
--- a/gcc/ira-emit.cc
+++ b/gcc/ira-emit.cc
@@ -510,8 +510,8 @@ generate_edge_moves (edge e)
 return;
   src_map = src_loop_node->regno_allocno_map;
   dest_map = dest_loop_node->regno_allocno_map;
-  regs_live_in_dest = df_get_live_in (e->dest);
-  regs_live_out_src = df_get_live_out (e->src);
+  regs_live_in_dest = DF_LIVE_SUBREG_IN (e->dest);
+  regs_live_out_src = DF_LIVE_SUBREG_OUT (e->src);
   EXECUTE_IF_SET_IN_REG_SET (regs_live_in_dest,
 FIRST_PSEUDO_REGISTER, regno, bi)
 if (bitmap_bit_p (regs_live_out_src, regno))
@@ -1229,16 +1229,16 @@ add_ranges_and_copies (void)
 destination block) to use for searching allocnos by their
 regnos because of subsequent IR flattening.  */
   node = IRA_BB_NODE (bb)->parent;
-  bitmap_copy (live_through, df_get_live_in (bb));
+  bitmap_copy (live_through, DF_LIVE_SUBREG_IN (bb));
   add_range_and_copies_from_move_list
(at_bb_start[bb->index], node, live_through, REG_FREQ_FROM_BB (bb));
-  bitmap_copy (live_through, df_get_live_out (bb));
+  bitmap_copy (live_through, DF_LIVE_SUBREG_OUT (bb));
   add_range_and_copies_from_move_list
(at_bb_end[bb->index], node, live_through, REG_FREQ_FROM_BB (bb));
   FOR_EACH_EDGE (e, ei, bb->succs)
{
- bitmap_and (live_through,
- df_get_live_in (e->dest), df_get_live_out (bb));
+ bitmap_and (live_through, DF_LIVE_SUBREG_IN (e->dest),
+ DF_LIVE_SUBREG_OUT (bb));
  add_range_and_copies_from_move_list
((move_t) e->aux, node, live_through,
 REG_FREQ_FROM_EDGE_FREQ (EDGE_FREQUE

[PATCH V3 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding
V3 Changes:
  1. fix three ICE.
  2. rebase

Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).

Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):

```
#include 

void
foo (int32_t *in, int32_t *out, size_t m)
{
  vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
  vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
  vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
  for (size_t i = 0; i < m; i++)
{
  v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
  v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
}
  *(vint32m1_t*)(out+4*0) = v0;
  *(vint32m1_t*)(out+4*1) = v1;
}
```

Before these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v4,0(a0)
vmv1r.v v2,v4
vmv1r.v v1,v5
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v1,v1,v1
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v1,0(a1)
ret
```

After these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v2,0(a0)
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v3,v3,v3
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v3,0(a1)
ret
```

As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.

If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.

```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
(subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) 
"/app/example.c":7:19 998 {*movrvvm1si_whole}
 (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
(subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) 
"/app/example.c":8:19 998 {*movrvvm1si_whole}
 (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
(nil)))
```

ira dump:

;; a1(r136,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;;
;; ...
  Popping a1(r135,l0)  -- assign reg 97
  Popping a3(r136,l0)  -- assign reg 98
  Popping a4(r137,l0)  -- assign reg 15
  Popping a5(r140,l0)  -- assign reg 12
  Popping a10(r145,l0)  -- assign reg 12
  Popping a2(r139,l0)  -- assign reg 11
  Popping a9(r144,l0)  -- assign reg 11
  Popping a0(r142,l0)  -- assign reg 11
  Popping a6(r134,l0)  -- assign reg 100
  Popping a7(r143,l0)  -- assign reg 10
  Popping a8(r141,l0)  -- assign reg 15

The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):

```
#include 

int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, 
int64_t*out)
{
  svint64x4_t result = svld4_s64 (pg, base);
  svint64_t v0 = svget4_s64(result, 0);
  svint64_t v1 = svget4_s64(result, 1);
  svint64_t v2 = svget4_s64(result, 2);
  svint64_t v3 = svget4_s64(result, 3);

  for (int i = 0; i < n; i += 1)
{
svint64_t v18 = svld1_s64(pg, in1);
svint64_t v19 = svld1_s64(pg, in2);
v0 = svmad_s64_z(pg, v0, v18, v19);
v1 = svmad_s64_z(pg, v1, v18, v19);
v2 = svmad_s64_z(pg, v2, v18, v19);
v3 = svmad_s64_z(pg, v3, v18, v19);
}
  svst1_s64(pg, out+0,v0);
  svst1_s64(pg, out+1,v1);
  svst1_s64(pg, out+2,v2);
  svst1_s64(pg, out+3,v3);
}
```

Before these patchs:

```
bar:
ld4d{z4.d - z7.d}, p0

Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding

Hi Vladimir,

On 2023/11/10 4:24, Vladimir Makarov wrote:


On 11/7/23 22:47, Lehua Ding wrote:


Lehua Ding (7):
   ira: Refactor the handling of register conflicts to make it more
 general
   ira: Add live_subreg problem and apply to ira pass
   ira: Support subreg live range track
   ira: Support subreg copy
   ira: Add all nregs >= 2 pseudos to tracke subreg list
   lra: Apply live_subreg df_problem to lra pass
   lra: Support subreg live range track and conflict detect

Thank you very much for addressing subreg RA.  It is a big work.  I 
wanted to address this long time ago but have no time to do this by myself.


I tried to evaluate your patches on x86-64 (i7-9700k) release mode GCC. 
I used -O3 for SPEC2017 compilation.


Here are the results:

    baseline baseline(+patches)
specint2017:  8.51 vs 8.58 (+0.8%)
specfp2017:   21.1 vs 21.1 (+0%)
compile time: 2426.41s vs 2580.58s (+6.4%)

Spec2017 average code size change: -0.07%

Improving specint by 0.8% is impressive for me.

Unfortunately, it is achieved by decreasing compilation speed by 6.4% 
(although on smaller benchmark I saw only 3% slowdown). I don't know how 
but we should mitigate this speed degradation.  May be we can find a hot 
spot in the new code (but I think it is not a linear search pointed by 
Richard Biener as the object vectors most probably contain 1-2 elements) 
and this code spot can be improved, or we could use this only for 
-O3/fast, or the code can be function or target dependent.


I also find GCC consumes more memory with the patches. May be it can be 
improved too (although I am not sure about this).


Thanks for the specint performance data. I'll do my best to get the 
compile time and memory issues fixed. I'm very curious to know if the 
way used to solve the subreg coalesce problem makes sense to you?


I'll start to review the patches on the next week.  I don't expect that 
I'll find something serious to reject the patches but again we should 
work on mitigation of the compilation speed problem.  We can fill a new 
PR for this and resolve the problem during the release cycle.


--
Best,
Lehua (RiVAI)
lehua.d...@rivai.ai



[PATCH] c++/modules: check mismatching exports for class tags [PR98885]

2023-11-12 Thread Nathaniel Shead
I think the error message is still a little bit unclear but I couldn't
come up with something clearer that was similarly concise and matching
the existing style.

(Also I noticed that the linked PR was assigned to Nathan but there
hadn't been activity for a while, and I've been looking into these kinds
of issues recently anyway so I thought I'd give it a go.)

Bootstrapped and regtested on x86_64-pc-linux-gnu. I don't have write
access.

-- >8 --

Checks for exporting a declaration that was previously declared as not
exported is implemented in 'duplicate_decls', but this doesn't handle
declarations of classes. This patch adds these checks and slightly
adjusts the associated error messages for clarity.

PR c++/98885

gcc/cp/ChangeLog:

* decl.cc (duplicate_decls): Adjust error message.
(xref_tag): Adjust error message. Check exporting decl that is
already declared as non-exporting.

gcc/testsuite/ChangeLog:

* g++.dg/modules/export-1.C: Adjust error messages. Remove
xfails for working case. Add new test case.

Signed-off-by: Nathaniel Shead 
---
 gcc/cp/decl.cc  | 21 ++---
 gcc/testsuite/g++.dg/modules/export-1.C | 16 +---
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 4a07c7e879b..bde9bd79d58 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -2236,8 +2236,10 @@ duplicate_decls (tree newdecl, tree olddecl, bool 
hiding, bool was_hidden)
  if (DECL_MODULE_EXPORT_P (STRIP_TEMPLATE (newdecl))
  && !DECL_MODULE_EXPORT_P (not_tmpl))
{
- error ("conflicting exporting declaration %qD", newdecl);
- inform (olddecl_loc, "previous declaration %q#D here", olddecl);
+ auto_diagnostic_group d;
+ error ("conflicting exporting for declaration %qD", newdecl);
+ inform (olddecl_loc,
+ "previously declared here without exporting");
}
}
   else if (DECL_MODULE_EXPORT_P (newdecl))
@@ -16249,11 +16251,24 @@ xref_tag (enum tag_types tag_code, tree name,
  tree decl = TYPE_NAME (t);
  if (!module_may_redeclare (decl))
{
+ auto_diagnostic_group d;
  error ("cannot declare %qD in a different module", decl);
- inform (DECL_SOURCE_LOCATION (decl), "declared here");
+ inform (DECL_SOURCE_LOCATION (decl), "previously declared here");
  return error_mark_node;
}
 
+ tree not_tmpl = STRIP_TEMPLATE (decl);
+ if (DECL_LANG_SPECIFIC (not_tmpl)
+ && DECL_MODULE_ATTACH_P (not_tmpl)
+ && !DECL_MODULE_EXPORT_P (not_tmpl)
+ && module_exporting_p ())
+   {
+ auto_diagnostic_group d;
+ error ("conflicting exporting for declaration %qD", decl);
+ inform (DECL_SOURCE_LOCATION (decl),
+ "previously declared here without exporting");
+   }
+
  tree maybe_tmpl = decl;
  if (CLASS_TYPE_P (t) && CLASSTYPE_IS_TEMPLATE (t))
maybe_tmpl = CLASSTYPE_TI_TEMPLATE (t);
diff --git a/gcc/testsuite/g++.dg/modules/export-1.C 
b/gcc/testsuite/g++.dg/modules/export-1.C
index 8ca696ebee0..3f93814d270 100644
--- a/gcc/testsuite/g++.dg/modules/export-1.C
+++ b/gcc/testsuite/g++.dg/modules/export-1.C
@@ -4,19 +4,21 @@ export module frob;
 // { dg-module-cmi !frob }
 
 int x ();
-export int x (); // { dg-error "conflicting exporting declaration" }
+export int x (); // { dg-error "conflicting exporting for declaration" }
 
 int y;
-export extern int y; // { dg-error "conflicting exporting declaration" }
+export extern int y; // { dg-error "conflicting exporting for declaration" }
 
 typedef int z;
-export typedef int z; // { dg-error "conflicting exporting declaration" }
+export typedef int z; // { dg-error "conflicting exporting for declaration" }
 
 template  int f (T);
-export template  int f (T); // { dg-error "conflicting exporting 
declaration" }
+export template  int f (T); // { dg-error "conflicting exporting 
for declaration" }
 
-// doesn't go via duplicate_decls so we miss this for now
 class A;
-export class A; // { dg-error "conflicting exporting declaration" "" { xfail 
*-*-* } }
+export class A; // { dg-error "conflicting exporting for declaration" }
 
-// { dg-warning  "due to errors" "" { target *-*-* } 0 }
+template  struct B;
+export template  struct B {};  // { dg-error "conflicting 
exporting for declaration" }
+
+// { dg-warning "due to errors" "" { target *-*-* } 0 }
-- 
2.42.0



Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Richard Sandiford
钟居哲  writes:
> Hi, Richard.
>
>>> Maybe dead lanes are better tracked at the gimple level though, not sure.
>>> (But AArch64 might need to lower lane operations more than it does now if
>>> we want gimple to handle it.)
>
> We were trying to address such issue at GIMPLE leve at the beginning.
> Tracking subreg-lanes of tuple type may be enough for aarch64 since aarch64 
> only tuple types.
> However, for RVV, that's not enough to address all issues.
> Consider this following situation:
> https://godbolt.org/z/fhTvEjvr8 
>
> You can see comparing with LLVM, GCC has so many redundant mov instructions 
> "vmv1r.v".
> Since GCC is not able to tracking subreg liveness, wheras LLVM can.
>
> The reason why tracking sub-lanes in GIMPLE can not address these redundant 
> move issues for RVV:
>
> 1. RVV has tuple type like "vint8m1x2_t" which is totoally the same as 
> aarch64 "svint8x1_t".
> It used by segment load/store which is similiar instruction "ld2r" 
> instruction in ARM SVE (vec_load_lanes/vec_store_lanes)
> Support sub-lanes tracking in GIMPLE can fix this situation for both RVV 
> and ARM SVE.
> 
> 2. However, we are not having "vint8m1x2_t", we also have "vint8m2_t" (LMUL 
> =2) which also occupies 2 regsiters
> which is not tuple type, instead, it is simple vector type. Such type is 
> used by all simple operations.
> For example, "vadd" with vint8m1_t is doing PLUS operation on single 
> vector registers, wheras same
> instruction "vadd“ with vint8m2_t is dong PLUS operation on 2 vector 
> registers.  Such type we can't
> define them as tuple type for following reasons:
> 1). we also have tuple type for LMUL > 1, for example, we also have 
> "vint8m2x2_t" has tuple type.
>  If we define "vint8m2_t" as tuple type, How about "vint8m2x2_t" ? , 
> Tuple type with tuple or
>  Array with array ? It makes type so strange.
> 2). RVV instrinsic doc define vint8m2x2_t as tuple type, but vint8m2_t 
> not tuple type. We are not able
>  to change the documents.
> 3). Clang has supported RVV intrinsics 3 years ago, vint8m2_t is not 
> tuple type for 3 years and widely
>  used, changing type definition will destroy ecosystem.  So for 
> compability, we are not able define
>  LMUL > 1 as tuple type.
>
> For these reasons, we should be able to access highpart of vint8m2_t and 
> lowpart of vint8m2_t, we provide
> vget to generate subreg access of the vector mode.
>
> So, at the discussion stage, we decided to address subpart access of vector 
> mode in more generic way,
> which is support subreg liveness tracking in RTL level. So that it can not 
> only address issues happens on ARM SVE,
> but also address issues for LMUL > 1.
>
> 3. After we decided to support subreg liveness tracking in RTL, we study LLVM.
> Actually, LLVM has a standalone PASS right before their linear scan RA 
> (greedy) call register coalescer.
> So, the first draft of our solution is supporting register coalescing 
> before RA which is opened source:
> riscv-gcc/gcc/ira-coalesce.cc at riscv-gcc-rvv-next · 
> riscv-collab/riscv-gcc (github.com)
> by simulating LLVM solution. However, we don't think such solution is 
> elegant and we have consulted
> Vlad.  Vlad suggested we should enhance IRA/LRA with subreg liveness 
> tracking which turns to be
> more reasonable and elegant approach. 
>
> So, after Lehua several experiments and investigations, he dedicate himself 
> produce this series of patches.
> And we think Lehua's approach should be generic and optimal solution to fix 
> this subreg generic problems.

Ah, sorry, I caused a misunderstanding.  In the message quoted above,
I'd moved on from talking about tracking liveness of vectors in a tuple.
I was instead talking about tracking the liveness of individual lanes
in a single vector.

I was responding to Jeff's description of the bit-level liveness tracking
pass.  That pass solves a generic issue: redundant sign and zero extensions.
But it sounded like it could also be reused for tracking lanes of a vector
(by using different bit ranges from the ones that Jeff listed).

The thing that I was saying might be better done on gimple was tracking
lanes of an individual vector.  In other words, I was arguing against
my own question.

I should have changed the subject line when responding, sorry.

I wasn't suggesting that we should avoid subreg tracking in the RA.
That's definitely needed for AArch64, and in general.

Thanks,
Richard


Re: [PATCH 2/3] Add generated .opt.urls files

2023-11-12 Thread Iain Buclaw
Excerpts from David Malcolm's message of November 10, 2023 10:42 pm:
> gcc/d/ChangeLog:
>   * lang.opt.urls: New file, autogenerated by
>   regenerate-opt-urls.py.
> ---
>  gcc/d/lang.opt.urls  |   95 +
>  create mode 100644 gcc/d/lang.opt.urls
> 

[abridged view of patch]

> diff --git a/gcc/d/lang.opt.urls b/gcc/d/lang.opt.urls
> new file mode 100644
> index ..57c14ecc459a
> --- /dev/null
> +++ b/gcc/d/lang.opt.urls
> @@ -0,0 +1,95 @@
> +; Autogenerated by regenerate-opt-urls.py from gcc/d/lang.opt and generated 
> HTML
> +
> +H
> +UrlSuffix(gcc/Preprocessor-Options.html#index-H)
> +
> +I
> +UrlSuffix(gcc/Directory-Options.html#index-I)
> +
> +M
> +UrlSuffix(gcc/Preprocessor-Options.html#index-M)
> +
> +MD
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MD)
> +
> +MF
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MF)
> +
> +MG
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MG)
> +
> +MM
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MM)
> +
> +MMD
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MMD)
> +
> +MP
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MP)
> +
> +MT
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MT)
> +
> +MQ
> +UrlSuffix(gcc/Preprocessor-Options.html#index-MQ)
> +
> +Waddress
> +UrlSuffix(gcc/Warning-Options.html#index-Waddress)
> +
> +; skipping 'Wall' due to multiple URLs:
> +;   duplicate: 'gcc/Standard-Libraries.html#index-Wall-1'
> +;   duplicate: 'gcc/Warning-Options.html#index-Wall'
> +
> +Walloca
> +UrlSuffix(gcc/Warning-Options.html#index-Walloca)
> +
> +Walloca-larger-than=
> +UrlSuffix(gcc/Warning-Options.html#index-Walloca-larger-than_003d)
> +
> +Wbuiltin-declaration-mismatch
> +UrlSuffix(gcc/Warning-Options.html#index-Wbuiltin-declaration-mismatch)
> +
> +Wdeprecated
> +UrlSuffix(gcc/Warning-Options.html#index-Wdeprecated)
> +
> +Werror
> +UrlSuffix(gcc/Warning-Options.html#index-Werror)
> +
> +Wextra
> +UrlSuffix(gcc/Warning-Options.html#index-Wextra)
> +
> +Wunknown-pragmas
> +UrlSuffix(gcc/Warning-Options.html#index-Wno-unknown-pragmas)
> +
> +Wvarargs
> +UrlSuffix(gcc/Warning-Options.html#index-Wno-varargs)
> +
> +; skipping 'fbuiltin' due to multiple URLs:
> +;   duplicate: 'gcc/C-Dialect-Options.html#index-fbuiltin'
> +;   duplicate: 'gcc/Other-Builtins.html#index-fno-builtin-3'
> +;   duplicate: 'gcc/Warning-Options.html#index-fno-builtin-1'
> +
> +fexceptions
> +UrlSuffix(gcc/Code-Gen-Options.html#index-fexceptions)
> +
> +frtti
> +UrlSuffix(gcc/C_002b_002b-Dialect-Options.html#index-fno-rtti)
> +
> +imultilib
> +UrlSuffix(gcc/Directory-Options.html#index-imultilib)
> +
> +iprefix
> +UrlSuffix(gcc/Directory-Options.html#index-iprefix)
> +
> +isysroot
> +UrlSuffix(gcc/Directory-Options.html#index-isysroot)
> +
> +isystem
> +UrlSuffix(gcc/Directory-Options.html#index-isystem)
> +
> +nostdinc
> +UrlSuffix(gcc/Directory-Options.html#index-nostdinc)
> +
> +v
> +UrlSuffix(gcc/Overall-Options.html#index-v)
> +
> -- 
> 2.26.3
> 
> 

So I see this focuses on only adding URLs for common options, or options
that relate to C/C++ family, but may be handled by other front-ends too?

To pick out one, you have:

frtti
UrlSuffix(gcc/C_002b_002b-Dialect-Options.html#index-fno-rtti)

It looks like it could could alternatively be

frtti
UrlSuffix(gdc/Runtime-Options.html#index-frtti)

Or are other front-ends having URLs to their language-specific
documentation pages not supported for the same reason as why they can't
add self-documentation to their own options if another front-end
(typically C/C++) also makes claim to the option?

frtti
D 
; Documented in C


I'm OK with the D parts regardless of this observation.

Thanks,
Iain.


Re: [PATCH 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding

Hi Dimitar,

I solved the problem you reported in V2 patch 
(https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636166.html), 
is it possible for you to help confirm this? Thank you very much.


On 2023/11/9 0:56, Dimitar Dimitrov wrote:

On Wed, Nov 08, 2023 at 11:47:33AM +0800, Lehua Ding wrote:

Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).


Hi Lehua,

This patch set breaks the build for at least three embedded targets. See
below.

For avr the GCC build fails with:
/mnt/nvme/dinux/local-workspace/gcc/gcc/ira-lives.cc:149:39: error: call of overloaded 
‘set_subreg_conflict_hard_regs(ira_allocno*&, int&)’ is ambiguous
   149 | set_subreg_conflict_hard_regs (OBJECT_ALLOCNO (obj), regno);


For arm-none-eabi the newlib build fails with:
/mnt/nvme/dinux/local-workspace/newlib/newlib/libm/math/e_jn.c:279:1: internal 
compiler error: Floating point exception
   279 | }
   | ^
0x1176e0f crash_signal
 /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
0xf6008d get_range_hard_regs(int, subreg_range const&)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
0xf6008d get_range_hard_regs(int, subreg_range const&)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:601
0xf60312 new_insn_reg
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
0xf6064d add_regs_to_insn_regno_info
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1623
0xf62909 lra_update_insn_regno_info(rtx_insn*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
0xf62e46 lra_update_insn_regno_info(rtx_insn*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1762
0xf62e46 lra_push_insn_1
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
0xf62f2d lra_push_insn(rtx_insn*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
0xf62f2d push_insns
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
0xf63302 push_insns
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1966
0xf63302 lra(_IO_FILE*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
0xf0e399 do_reload
 /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
0xf0e399 execute
 /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148


For pru-elf the GCC build fails with:
/mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c: In function 
'linear_search_fdes':
/mnt/nvme/dinux/local-workspace/gcc/libgcc/unwind-dw2-fde.c:1035:1: internal 
compiler error: Floating point exception
  1035 | }
   | ^
0x1694f2e crash_signal
 /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:316
0x1313178 get_range_hard_regs(int, subreg_range const&)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:609
0x131343a new_insn_reg
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:658
0x13174f0 add_regs_to_insn_regno_info
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1608
0x1318479 lra_update_insn_regno_info(rtx_insn*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1769
0x13196ab lra_push_insn_1
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1919
0x13196de lra_push_insn(rtx_insn*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1927
0x13197da push_insns
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:1970
0x131b6dc lra(_IO_FILE*)
 /mnt/nvme/dinux/local-workspace/gcc/gcc/lra.cc:2511
0x129f237 do_reload
 /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:5960
0x129f6c6 execute
 /mnt/nvme/dinux/local-workspace/gcc/gcc/ira.cc:6148


The divide by zero error above is interesting. I'm not sure why 
ira_reg_class_max_nregs[] yields 0 for the pseudo register 168 in the following 
rtx:
(debug_insn 168 167 169 19 (var_location:SI encoding (reg/v:SI 168 [ encoding 
])) -1
  (nil))

Regards,
Dimitar



--
Best,
Lehua (RiVAI)
lehua.d...@rivai.ai



[PATCH V2 6/7] lra: Switch to live_subreg data flow

2023-11-12 Thread Lehua Ding
This patch switches the live_reg data in lra to live_subreg data,
and the situation will be more complicated than in ira because
this part of the data is modified in lra also and the live_subreg
data will be recalculated.

gcc/ChangeLog:

* lra-coalesce.cc (update_live_info):
Adjust to new live subreg data.
(lra_coalesce): Ditto.
* lra-constraints.cc (update_ebb_live_info): Ditto.
(get_live_on_other_edges): Ditto.
(inherit_in_ebb): Ditto.
(lra_inheritance): Ditto.
(fix_bb_live_info): Ditto.
(remove_inheritance_pseudos): Ditto.
* lra-int.h (GCC_LRA_INT_H): Ditto.
* lra-lives.cc (class bb_data_pseudos): Ditto.
(make_hard_regno_live): Ditto.
(make_hard_regno_dead): Ditto.
(mark_regno_live): Ditto.
(mark_regno_dead): Ditto.
(live_trans_fun): Ditto.
(live_con_fun_0): Ditto.
(live_con_fun_n): Ditto.
(initiate_live_solver): Ditto.
(finish_live_solver): Ditto.
(process_bb_lives): Ditto.
(lra_create_live_ranges_1): Ditto.
* lra-remat.cc (dump_candidates_and_remat_bb_data): Ditto.
(calculate_livein_cands): Ditto.
(do_remat): Ditto.
* lra-spills.cc (spill_pseudos): Ditto.

---
 gcc/lra-coalesce.cc|  20 ++-
 gcc/lra-constraints.cc |  93 +---
 gcc/lra-int.h  |   2 +
 gcc/lra-lives.cc   | 328 -
 gcc/lra-remat.cc   |  13 +-
 gcc/lra-spills.cc  |  22 ++-
 6 files changed, 374 insertions(+), 104 deletions(-)

diff --git a/gcc/lra-coalesce.cc b/gcc/lra-coalesce.cc
index 04a5bbd714b..abfc54f1cc2 100644
--- a/gcc/lra-coalesce.cc
+++ b/gcc/lra-coalesce.cc
@@ -188,19 +188,25 @@ static bitmap_head used_pseudos_bitmap;
 /* Set up USED_PSEUDOS_BITMAP, and update LR_BITMAP (a BB live info
bitmap).  */
 static void
-update_live_info (bitmap lr_bitmap)
+update_live_info (bitmap all, bitmap full, bitmap partial)
 {
   unsigned int j;
   bitmap_iterator bi;
 
   bitmap_clear (&used_pseudos_bitmap);
-  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, lr_bitmap,
+  EXECUTE_IF_AND_IN_BITMAP (&coalesced_pseudos_bitmap, all,
FIRST_PSEUDO_REGISTER, j, bi)
 bitmap_set_bit (&used_pseudos_bitmap, first_coalesced_pseudo[j]);
   if (! bitmap_empty_p (&used_pseudos_bitmap))
 {
-  bitmap_and_compl_into (lr_bitmap, &coalesced_pseudos_bitmap);
-  bitmap_ior_into (lr_bitmap, &used_pseudos_bitmap);
+  bitmap_and_compl_into (all, &coalesced_pseudos_bitmap);
+  bitmap_ior_into (all, &used_pseudos_bitmap);
+
+  bitmap_and_compl_into (full, &coalesced_pseudos_bitmap);
+  bitmap_ior_and_compl_into (full, &used_pseudos_bitmap, partial);
+
+  bitmap_and_compl_into (partial, &coalesced_pseudos_bitmap);
+  bitmap_ior_and_compl_into (partial, &used_pseudos_bitmap, full);
 }
 }
 
@@ -303,8 +309,10 @@ lra_coalesce (void)
   bitmap_initialize (&used_pseudos_bitmap, ®_obstack);
   FOR_EACH_BB_FN (bb, cfun)
 {
-  update_live_info (df_get_live_in (bb));
-  update_live_info (df_get_live_out (bb));
+  update_live_info (DF_LIVE_SUBREG_IN (bb), DF_LIVE_SUBREG_FULL_IN (bb),
+   DF_LIVE_SUBREG_PARTIAL_IN (bb));
+  update_live_info (DF_LIVE_SUBREG_OUT (bb), DF_LIVE_SUBREG_FULL_OUT (bb),
+   DF_LIVE_SUBREG_PARTIAL_OUT (bb));
   FOR_BB_INSNS_SAFE (bb, insn, next)
if (INSN_P (insn)
&& bitmap_bit_p (&involved_insns_bitmap, INSN_UID (insn)))
diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 0607c8be7cb..c3ad846b97b 100644
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -6571,34 +6571,75 @@ update_ebb_live_info (rtx_insn *head, rtx_insn *tail)
{
  if (prev_bb != NULL)
{
- /* Update df_get_live_in (prev_bb):  */
+ /* Update subreg live (prev_bb):  */
+ bitmap subreg_all_in = DF_LIVE_SUBREG_IN (prev_bb);
+ bitmap subreg_full_in = DF_LIVE_SUBREG_FULL_IN (prev_bb);
+ bitmap subreg_partial_in = DF_LIVE_SUBREG_PARTIAL_IN (prev_bb);
+ subregs_live *range_in = DF_LIVE_SUBREG_RANGE_IN (prev_bb);
  EXECUTE_IF_SET_IN_BITMAP (&check_only_regs, 0, j, bi)
if (bitmap_bit_p (&live_regs, j))
- bitmap_set_bit (df_get_live_in (prev_bb), j);
-   else
- bitmap_clear_bit (df_get_live_in (prev_bb), j);
+ {
+   bitmap_set_bit (subreg_all_in, j);
+   bitmap_set_bit (subreg_full_in, j);
+   if (bitmap_bit_p (subreg_partial_in, j))
+ {
+   bitmap_clear_bit (subreg_partial_in, j);
+   range_in->remove_live (j);
+ }
+ }
+   else if (bitmap_bit_p (subreg_all_in, j))
+ {
+   bi

[PATCH V2 1/7] df: Add DF_LIVE_SUBREG problem

2023-11-12 Thread Lehua Ding
This patch adds a live_subreg problem to extend the original live_reg to
track the liveness of subreg. We will only try to trace speudo registers
who's mode size is a multiple of nature size and eventually a small portion
of the inside will appear to use subreg. With live_reg problem, live_subreg
prbolem will have the following output. full_in/out mean the entire pesudo
live in/out, partial_in/out mean the subregs of the pesudo are live in/out,
and range_in/out indicates which part of the pesudo is live. all_in/out is
the union of full_in/out and partial_in/out:

  bitmap_head all_in, full_in;
  bitmap_head all_out, full_out;
  bitmap_head partial_in;
  bitmap_head partial_out;
  subregs_live *range_in = NULL;
  subregs_live *range_out = NULL;

gcc/ChangeLog:

* Makefile.in: Add new object file.
* df-problems.cc (struct df_live_subreg_problem_data):
The data of the new live_subreg problem.
(need_track_subreg): New function.
(get_range): Ditto.
(remove_subreg_range): Ditto.
(add_subreg_range): Ditto.
(df_live_subreg_free_bb_info): Ditto.
(df_live_subreg_alloc): Ditto.
(df_live_subreg_reset): Ditto.
(df_live_subreg_bb_local_compute): Ditto.
(df_live_subreg_local_compute): Ditto.
(df_live_subreg_init): Ditto.
(df_live_subreg_check_result): Ditto.
(df_live_subreg_confluence_0): Ditto.
(df_live_subreg_confluence_n): Ditto.
(df_live_subreg_transfer_function): Ditto.
(df_live_subreg_finalize): Ditto.
(df_live_subreg_free): Ditto.
(df_live_subreg_top_dump): Ditto.
(df_live_subreg_bottom_dump): Ditto.
(df_live_subreg_add_problem): Ditto.
* df.h (enum df_problem_id): Add live_subreg id.
(DF_LIVE_SUBREG_INFO): Data accessor.
(DF_LIVE_SUBREG_IN): Ditto.
(DF_LIVE_SUBREG_OUT): Ditto.
(DF_LIVE_SUBREG_FULL_IN): Ditto.
(DF_LIVE_SUBREG_FULL_OUT): Ditto.
(DF_LIVE_SUBREG_PARTIAL_IN): Ditto.
(DF_LIVE_SUBREG_PARTIAL_OUT): Ditto.
(DF_LIVE_SUBREG_RANGE_IN): Ditto.
(DF_LIVE_SUBREG_RANGE_OUT): Ditto.
(class subregs_live): New class.
(class basic_block_subreg_live_info): Ditto.
(class df_live_subreg_bb_info): Ditto.
(df_live_subreg): Ditto.
(df_live_subreg_add_problem): Ditto.
(df_live_subreg_finalize): Ditto.
(class subreg_range): Ditto.
(need_track_subreg): Ditto.
(remove_subreg_range): Ditto.
(add_subreg_range): Ditto.
(df_live_subreg_get_bb_info): Ditto.
* regs.h (get_nblocks): Helper function.
* timevar.def (TV_DF_LIVE_SUBREG): New timevar.
* subreg-live-range.cc: New file.
* subreg-live-range.h: New file.

---
 gcc/Makefile.in  |   1 +
 gcc/df-problems.cc   | 889 ++-
 gcc/df.h |  67 +++
 gcc/regs.h   |   7 +
 gcc/subreg-live-range.cc | 628 +++
 gcc/subreg-live-range.h  | 333 +++
 gcc/timevar.def  |   1 +
 7 files changed, 1925 insertions(+), 1 deletion(-)
 create mode 100644 gcc/subreg-live-range.cc
 create mode 100644 gcc/subreg-live-range.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 29cec21c825..e4403b5a30c 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1675,6 +1675,7 @@ OBJS = \
store-motion.o \
streamer-hooks.o \
stringpool.o \
+subreg-live-range.o \
substring-locations.o \
target-globals.o \
targhooks.o \
diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
index d2cfaf7f50f..2585c762fd1 100644
--- a/gcc/df-problems.cc
+++ b/gcc/df-problems.cc
@@ -28,6 +28,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "rtl.h"
 #include "df.h"
+#include "subreg-live-range.h"
 #include "memmodel.h"
 #include "tm_p.h"
 #include "insn-config.h"
@@ -1344,8 +1345,894 @@ df_lr_verify_transfer_functions (void)
   bitmap_clear (&all_blocks);
 }
 
+/*
+   REGISTER AND SUBREG LIVES
+   Like DF_RL, but fine-grained tracking of subreg lifecycle.
+   
*/
+
+/* Private data used to verify the solution for this problem.  */
+struct df_live_subreg_problem_data
+{
+  /* An obstack for the bitmaps we need for this problem.  */
+  bitmap_obstack live_subreg_bitmaps;
+  bool has_subreg_live_p;
+};
+
+/* Helper functions */
+
+/* Return true if REGNO is a pseudo and MODE is a multil regs size.  */
+bool
+need_track_subreg (int regno, machine_mode reg_mode)
+{
+  poly_int64 total_size = GET_MODE_SIZE (reg_mode);
+  poly_int64 natural_size = REGMODE_NATURAL_SIZE (reg_mode);
+  return maybe_gt (total_size, natural_size)
+&& multiple_p (total_size, natural_size)
+&& regno >= FIRST_PSEUDO_REGISTER;
+

[PATCH V2 5/7] ira: Add all nregs >= 2 pseudos to tracke subreg list

2023-11-12 Thread Lehua Ding
This patch relax the subreg track capability to all subreg registers.

gcc/ChangeLog:

* ira-build.cc (get_reg_unit_size): New.
(has_same_nregs): New.
(ira_set_allocno_class): Adjust.

---
 gcc/ira-build.cc | 41 -
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index 13f0f7336ed..f88aaef 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -607,6 +607,37 @@ ira_create_allocno (int regno, bool cap_p,
   return a;
 }
 
+/* Return single register size of allocno A.  */
+static poly_int64
+get_reg_unit_size (ira_allocno_t a)
+{
+  enum reg_class aclass = ALLOCNO_CLASS (a);
+  gcc_assert (aclass != NO_REGS);
+  machine_mode mode = ALLOCNO_MODE (a);
+  int nregs = ALLOCNO_NREGS (a);
+  poly_int64 block_size = REGMODE_NATURAL_SIZE (mode);
+  int nblocks = get_nblocks (mode);
+  gcc_assert (nblocks % nregs == 0);
+  return block_size * (nblocks / nregs);
+}
+
+/* Return true if TARGET_CLASS_MAX_NREGS and TARGET_HARD_REGNO_NREGS results is
+   same. It should be noted that some targets may not implement these two very
+   uniformly, and need to be debugged step by step. For example, in V3x1DI mode
+   in AArch64, TARGET_CLASS_MAX_NREGS returns 2 but TARGET_HARD_REGNO_NREGS
+   returns 3. They are in conflict and need to be repaired in the Hook of
+   AArch64.  */
+static bool
+has_same_nregs (ira_allocno_t a)
+{
+  for (int i = 0; i < FIRST_PSEUDO_REGISTER; i++)
+if (REGNO_REG_CLASS (i) != NO_REGS
+   && reg_class_subset_p (REGNO_REG_CLASS (i), ALLOCNO_CLASS (a))
+   && ALLOCNO_NREGS (a) != hard_regno_nregs (i, ALLOCNO_MODE (a)))
+  return false;
+  return true;
+}
+
 /* Set up register class for A and update its conflict hard
registers.  */
 void
@@ -624,12 +655,12 @@ ira_set_allocno_class (ira_allocno_t a, enum reg_class 
aclass)
 
   if (aclass == NO_REGS)
 return;
-  /* SET the unit_size of one register.  */
-  machine_mode mode = ALLOCNO_MODE (a);
-  int nregs = ira_reg_class_max_nregs[aclass][mode];
-  if (nregs == 2 && maybe_eq (GET_MODE_SIZE (mode), nregs * UNITS_PER_WORD))
+  gcc_assert (!ALLOCNO_TRACK_SUBREG_P (a));
+  /* Set unit size and track_subreg_p flag for pseudo which need occupied multi
+ hard regs.  */
+  if (ALLOCNO_NREGS (a) > 1 && has_same_nregs (a))
 {
-  ALLOCNO_UNIT_SIZE (a) = UNITS_PER_WORD;
+  ALLOCNO_UNIT_SIZE (a) = get_reg_unit_size (a);
   ALLOCNO_TRACK_SUBREG_P (a) = true;
   return;
 }
-- 
2.36.3



[PATCH V2 7/7] lra: Support subreg live range track and conflict detect

2023-11-12 Thread Lehua Ding
This patch supports tracking the liveness of a subreg in a lra pass, with the
goal of getting it to agree with ira's register allocation scheme. There is some
duplication, maybe in the future this part of the code logic can be harmonized.

gcc/ChangeLog:

* ira-build.cc (setup_pseudos_has_subreg_object):
Collect new data for lra to use.
(ira_build): Ditto.
* lra-assigns.cc (set_offset_conflicts): New function.
(setup_live_pseudos_and_spill_after_risky_transforms): Adjust.
(lra_assign): Ditto.
* lra-constraints.cc (process_alt_operands): Ditto.
* lra-int.h (GCC_LRA_INT_H): Ditto.
(struct lra_live_range): Ditto.
(struct lra_insn_reg): Ditto.
(get_range_hard_regs): New.
(get_nregs): New.
(has_subreg_object_p): New.
* lra-lives.cc (INCLUDE_VECTOR): Adjust.
(lra_live_range_pool): Ditto.
(create_live_range): Ditto.
(lra_merge_live_ranges): Ditto.
(update_pseudo_point): Ditto.
(mark_regno_live): Ditto.
(mark_regno_dead): Ditto.
(process_bb_lives): Ditto.
(remove_some_program_points_and_update_live_ranges): Ditto.
(lra_print_live_range_list): Ditto.
(class subreg_live_item): New.
(create_subregs_live_ranges): New.
(lra_create_live_ranges_1): Ditto.
* lra.cc (get_range_blocks): Ditto.
(get_range_hard_regs): Ditto.
(new_insn_reg): Ditto.
(collect_non_operand_hard_regs): Ditto.
(initialize_lra_reg_info_element): Ditto.
(reg_same_range_p): New.
(add_regs_to_insn_regno_info): Adjust.

---
 gcc/ira-build.cc   |  31 
 gcc/lra-assigns.cc | 111 --
 gcc/lra-constraints.cc |  18 ++-
 gcc/lra-int.h  |  31 
 gcc/lra-lives.cc   | 340 ++---
 gcc/lra.cc | 139 +++--
 6 files changed, 585 insertions(+), 85 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index f88aaef..bb29627d375 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -95,6 +95,9 @@ int ira_copies_num;
basic block.  */
 static int last_basic_block_before_change;
 
+/* Record these pseudos which has subreg object. Used by LRA pass.  */
+bitmap_head pseudos_has_subreg_object;
+
 /* Initialize some members in loop tree node NODE.  Use LOOP_NUM for
the member loop_num.  */
 static void
@@ -3711,6 +3714,33 @@ update_conflict_hard_reg_costs (void)
 }
 }
 
+/* Setup speudos_has_subreg_object.  */
+static void
+setup_pseudos_has_subreg_object ()
+{
+  bitmap_initialize (&pseudos_has_subreg_object, ®_obstack);
+  ira_allocno_t a;
+  ira_allocno_iterator ai;
+  FOR_EACH_ALLOCNO (a, ai)
+if (has_subreg_object_p (a))
+  {
+   bitmap_set_bit (&pseudos_has_subreg_object, ALLOCNO_REGNO (a));
+   if (ira_dump_file != NULL)
+ {
+   fprintf (ira_dump_file,
+"  a%d(r%d, nregs: %d) has subreg objects:\n",
+ALLOCNO_NUM (a), ALLOCNO_REGNO (a), ALLOCNO_NREGS (a));
+   ira_allocno_object_iterator oi;
+   ira_object_t obj;
+   FOR_EACH_ALLOCNO_OBJECT (a, obj, oi)
+ fprintf (ira_dump_file, "object %d: start: %d, nregs: %d\n",
+  OBJECT_INDEX (obj), OBJECT_START (obj),
+  OBJECT_NREGS (obj));
+   fprintf (ira_dump_file, "\n");
+ }
+  }
+}
+
 /* Create a internal representation (IR) for IRA (allocnos, copies,
loop tree nodes).  The function returns TRUE if we generate loop
structure (besides nodes representing all function and the basic
@@ -3731,6 +3761,7 @@ ira_build (void)
   create_allocnos ();
   ira_costs ();
   create_allocno_objects ();
+  setup_pseudos_has_subreg_object ();
   ira_create_allocno_live_ranges ();
   remove_unnecessary_regions (false);
   ira_compress_allocno_live_ranges ();
diff --git a/gcc/lra-assigns.cc b/gcc/lra-assigns.cc
index d2ebcfd5056..6588a740162 100644
--- a/gcc/lra-assigns.cc
+++ b/gcc/lra-assigns.cc
@@ -1131,6 +1131,52 @@ assign_hard_regno (int hard_regno, int regno)
 /* Array used for sorting different pseudos.  */
 static int *sorted_pseudos;
 
+/* The detail conflict offsets If two live ranges conflict. Use to record
+   partail conflict.  */
+static bitmap_head live_range_conflicts;
+
+/* Set the conflict offset of the two registers REGNO1 and REGNO2. Use the
+   regno with bigger nregs as the base.  */
+static void
+set_offset_conflicts (int regno1, int regno2)
+{
+  gcc_assert (reg_renumber[regno1] >= 0 && reg_renumber[regno2] >= 0);
+  int nregs1 = get_nregs (regno1);
+  int nregs2 = get_nregs (regno2);
+  if (nregs1 < nregs2)
+{
+  std::swap (nregs1, nregs2);
+  std::swap (regno1, regno2);
+}
+
+  lra_live_range_t r1 = lra_reg_info[regno1].live_ranges;
+  lra_live_range_t r2 = lra_reg_info[regno2].live_ranges;
+  int total = nregs1;
+
+  bitmap_clear (&live_range_confli

[PATCH V2 0/7] ira/lra: Support subreg coalesce

2023-11-12 Thread Lehua Ding
Hi,

These patchs try to support subreg coalesce feature in
register allocation passes (ira and lra).

Let's consider a RISC-V program (https://godbolt.org/z/ec51d91aT):

```
#include 

void
foo (int32_t *in, int32_t *out, size_t m)
{
  vint32m2_t result = __riscv_vle32_v_i32m2 (in, 32);
  vint32m1_t v0 = __riscv_vget_v_i32m2_i32m1 (result, 0);
  vint32m1_t v1 = __riscv_vget_v_i32m2_i32m1 (result, 1);
  for (size_t i = 0; i < m; i++)
{
  v0 = __riscv_vadd_vv_i32m1(v0, v0, 4);
  v1 = __riscv_vmul_vv_i32m1(v1, v1, 4);
}
  *(vint32m1_t*)(out+4*0) = v0;
  *(vint32m1_t*)(out+4*1) = v1;
}
```

Before these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v4,0(a0)
vmv1r.v v2,v4
vmv1r.v v1,v5
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v1,v1,v1
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v1,0(a1)
ret
```

After these patchs:

```
foo:
li  a5,32
vsetvli zero,a5,e32,m2,ta,ma
vle32.v v2,0(a0)
beq a2,zero,.L2
li  a5,0
vsetivlizero,4,e32,m1,ta,ma
.L3:
addia5,a5,1
vadd.vv v2,v2,v2
vmul.vv v3,v3,v3
bne a2,a5,.L3
.L2:
vs1r.v  v2,0(a1)
addia1,a1,16
vs1r.v  v3,0(a1)
ret
```

As you can see, the two redundant vmv1r.v instructions were removed.
The reason for the two redundant vmv1r.v instructions is because
the current ira pass is being conservative in calculating the live
range of pseduo registers that occupy multil hardregs. As in the
following two RTL instructions. Where r134 occupies two physical
registers and r135 and r136 occupy one physical register.
At insn 12 point, ira considers the entire r134 pseudo register
to be live, so r135 is in conflict with r134, as shown in the ira
dump info. Then when the physical registers are allocated, r135 and
r134 are allocated first because they are inside the loop body and
have higher priority. This makes it difficult to assign r136 to
overlap with r134, i.e., to assign r136 to hr100, thus eliminating
the need for the vmv1r.v instruction. Thus two vmv1r.v instructions
appear.

If we refine the live information of r134 to the case of each subreg,
we can remove this conflict. We can then create copies of the set
with subreg reference, thus increasing the priority of the r134 allocation,
which allow registers with bigger alignment requirements to prioritize
the allocation of physical registers. In RVV, pseudo registers occupying
two physical registers need to be time-2 aligned.

```
(insn 11 10 12 2 (set (reg/v:RVVM1SI 135 [ v0 ])
(subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) 0)) 
"/app/example.c":7:19 998 {*movrvvm1si_whole}
 (nil))
(insn 12 11 13 2 (set (reg/v:RVVM1SI 136 [ v1 ])
(subreg:RVVM1SI (reg/v:RVVM2SI 134 [ result ]) [16, 16])) 
"/app/example.c":8:19 998 {*movrvvm1si_whole}
 (expr_list:REG_DEAD (reg/v:RVVM2SI 134 [ result ])
(nil)))
```

ira dump:

;; a1(r136,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a3(r135,l0) conflicts: a1(r136,l0) a6(r134,l0)
;; total conflict hard regs:
;; conflict hard regs:
;; a6(r134,l0) conflicts: a3(r135,l0)
;; total conflict hard regs:
;; conflict hard regs:
;;
;; ...
  Popping a1(r135,l0)  -- assign reg 97
  Popping a3(r136,l0)  -- assign reg 98
  Popping a4(r137,l0)  -- assign reg 15
  Popping a5(r140,l0)  -- assign reg 12
  Popping a10(r145,l0)  -- assign reg 12
  Popping a2(r139,l0)  -- assign reg 11
  Popping a9(r144,l0)  -- assign reg 11
  Popping a0(r142,l0)  -- assign reg 11
  Popping a6(r134,l0)  -- assign reg 100
  Popping a7(r143,l0)  -- assign reg 10
  Popping a8(r141,l0)  -- assign reg 15

The AArch64 SVE has the same problem. Consider the following
code (https://godbolt.org/z/MYrK7Ghaj):

```
#include 

int bar (svbool_t pg, int64_t* base, int n, int64_t *in1, int64_t *in2, 
int64_t*out)
{
  svint64x4_t result = svld4_s64 (pg, base);
  svint64_t v0 = svget4_s64(result, 0);
  svint64_t v1 = svget4_s64(result, 1);
  svint64_t v2 = svget4_s64(result, 2);
  svint64_t v3 = svget4_s64(result, 3);

  for (int i = 0; i < n; i += 1)
{
svint64_t v18 = svld1_s64(pg, in1);
svint64_t v19 = svld1_s64(pg, in2);
v0 = svmad_s64_z(pg, v0, v18, v19);
v1 = svmad_s64_z(pg, v1, v18, v19);
v2 = svmad_s64_z(pg, v2, v18, v19);
v3 = svmad_s64_z(pg, v3, v18, v19);
}
  svst1_s64(pg, out+0,v0);
  svst1_s64(pg, out+1,v1);
  svst1_s64(pg, out+2,v2);
  svst1_s64(pg, out+3,v3);
}
```

Before these patchs:

```
bar:
ld4d{z4.d - z7.d}, p0/z, [x0]
mov z26.d, z4.d

[PATCH V2 4/7] ira: Support subreg copy

2023-11-12 Thread Lehua Ding
This patch changes the previous way of creating a copy between allocnos to 
objects.

gcc/ChangeLog:

* ira-build.cc (find_allocno_copy): Removed.
(find_object): New.
(ira_create_copy): Adjust.
(add_allocno_copy_to_list): Adjust.
(swap_allocno_copy_ends_if_necessary): Adjust.
(ira_add_allocno_copy): Adjust.
(print_copy): Adjust.
(print_allocno_copies): Adjust.
(ira_flattening): Adjust.
* ira-color.cc (INCLUDE_VECTOR): Include vector.
(struct allocno_color_data): Adjust.
(struct allocno_hard_regs_subnode): Adjust.
(form_allocno_hard_regs_nodes_forest): Adjust.
(update_left_conflict_sizes_p): Adjust.
(struct update_cost_queue_elem): Adjust.
(queue_update_cost): Adjust.
(get_next_update_cost): Adjust.
(update_costs_from_allocno): Adjust.
(update_conflict_hard_regno_costs): Adjust.
(assign_hard_reg): Adjust.
(objects_conflict_by_live_ranges_p): New.
(allocno_thread_conflict_p): Adjust.
(object_thread_conflict_p): Ditto.
(merge_threads): Ditto.
(form_threads_from_copies): Ditto.
(form_threads_from_bucket): Ditto.
(form_threads_from_colorable_allocno): Ditto.
(init_allocno_threads): Ditto.
(add_allocno_to_bucket): Ditto.
(delete_allocno_from_bucket): Ditto.
(allocno_copy_cost_saving): Ditto.
(color_allocnos): Ditto.
(color_pass): Ditto.
(update_curr_costs): Ditto.
(coalesce_allocnos): Ditto.
(ira_reuse_stack_slot): Ditto.
(ira_initiate_assign): Ditto.
(ira_finish_assign): Ditto.
* ira-conflicts.cc (allocnos_conflict_for_copy_p): Ditto.
(REG_SUBREG_P): Ditto.
(subreg_move_p): New.
(regs_non_conflict_for_copy_p): New.
(subreg_reg_align_and_times_p): New.
(process_regs_for_copy): Ditto.
(add_insn_allocno_copies): Ditto.
(propagate_copies): Ditto.
* ira-emit.cc (add_range_and_copies_from_move_list): Ditto.
* ira-int.h (struct ira_allocno_copy): Ditto.
(ira_add_allocno_copy): Ditto.
(find_object): Exported.
(subreg_move_p): Exported.
* ira.cc (print_redundant_copies): Exported.

---
 gcc/ira-build.cc | 154 +++-
 gcc/ira-color.cc | 541 +++
 gcc/ira-conflicts.cc | 173 +++---
 gcc/ira-emit.cc  |  10 +-
 gcc/ira-int.h|  10 +-
 gcc/ira.cc   |   5 +-
 6 files changed, 646 insertions(+), 247 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index a32693e69e4..13f0f7336ed 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -36,9 +36,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "subreg-live-range.h"
 
-static ira_copy_t find_allocno_copy (ira_allocno_t, ira_allocno_t, rtx_insn *,
-ira_loop_tree_node_t);
-
 /* The root of the loop tree corresponding to the all function.  */
 ira_loop_tree_node_t ira_loop_tree_root;
 
@@ -520,6 +517,16 @@ find_object (ira_allocno_t a, poly_int64 offset, 
poly_int64 size)
   return find_object (a, subreg_start, subreg_nregs);
 }
 
+/* Return object in allocno A for REG.  */
+ira_object_t
+find_object (ira_allocno_t a, rtx reg)
+{
+  if (has_subreg_object_p (a) && read_modify_subreg_p (reg))
+return find_object (a, SUBREG_BYTE (reg), GET_MODE_SIZE (GET_MODE (reg)));
+  else
+return find_object (a, 0, ALLOCNO_NREGS (a));
+}
+
 /* Return the object in allocno A which match START & NREGS.  Create when not
found.  */
 ira_object_t
@@ -1503,27 +1510,36 @@ initiate_copies (void)
 /* Return copy connecting A1 and A2 and originated from INSN of
LOOP_TREE_NODE if any.  */
 static ira_copy_t
-find_allocno_copy (ira_allocno_t a1, ira_allocno_t a2, rtx_insn *insn,
+find_allocno_copy (ira_object_t obj1, ira_object_t obj2, rtx_insn *insn,
   ira_loop_tree_node_t loop_tree_node)
 {
   ira_copy_t cp, next_cp;
-  ira_allocno_t another_a;
+  ira_object_t another_obj;
 
+  ira_allocno_t a1 = OBJECT_ALLOCNO (obj1);
   for (cp = ALLOCNO_COPIES (a1); cp != NULL; cp = next_cp)
 {
-  if (cp->first == a1)
+  ira_allocno_t first_a = OBJECT_ALLOCNO (cp->first);
+  ira_allocno_t second_a = OBJECT_ALLOCNO (cp->second);
+  if (first_a == a1)
{
  next_cp = cp->next_first_allocno_copy;
- another_a = cp->second;
+ if (cp->first == obj1)
+   another_obj = cp->second;
+ else
+   continue;
}
-  else if (cp->second == a1)
+  else if (second_a == a1)
{
  next_cp = cp->next_second_allocno_copy;
- another_a = cp->first;
+ if (cp->second == obj1)
+   another_obj = cp->first;
+ else
+   continue;
}
   else
gcc_unreachable ();
-  if (another_a == a2 && c

[PATCH V2 3/7] ira: Support subreg live range track

2023-11-12 Thread Lehua Ding
This patch supports tracking subreg liveness. It first extends
ira_object_t objects[2] to std::vector objects,
which can hold more than one object, and is used to collect all
access via subreg in program and the partial_in and partial_out
of the basic block live in/out.

Then there is a modification to the way conflicts between registers
are detected, for example, if a object conflicts with b object, then
the offset and size of the object relative to the allocno it belongs
to need to be taken into account to compute the conflict registers
between allocno and allocno.

gcc/ChangeLog:

* hard-reg-set.h (struct HARD_REG_SET): New shift operator.
* ira-build.cc (ira_create_object): Adjust.
(find_object): New.
(find_object_anyway): New.
(ira_create_allocno): Adjust.
(get_range): New.
(ira_copy_allocno_objects): New.
(merge_hard_reg_conflicts): Adjust copy.
(create_cap_allocno): Adjust.
(find_subreg_p): New.
(add_subregs): New.
(create_insn_allocnos): Collect subreg.
(create_bb_allocnos): Ditto.
(move_allocno_live_ranges): Adjust.
(copy_allocno_live_ranges): Adjust.
(setup_min_max_allocno_live_range_point): Adjust.
* ira-color.cc (INCLUDE_MAP): include map.
(setup_left_conflict_sizes_p): Adjust conflict size.
(setup_profitable_hard_regs): Adjust.
(get_conflict_and_start_profitable_regs): Adjust.
(check_hard_reg_p): Adjust conflict check.
(assign_hard_reg): Adjust.
(push_allocno_to_stack): Adjust conflict size.
(improve_allocation): Adjust.
* ira-conflicts.cc (record_object_conflict): Simplify.
(build_object_conflicts): Adjust.
(build_conflicts): Adjust.
(print_allocno_conflicts): Adjust.
* ira-emit.cc (modify_move_list): Adjust.
* ira-int.h (struct ira_object): Adjust struct.
(struct ira_allocno): Adjust struct.
(ALLOCNO_NUM_OBJECTS): New accessor.
(ALLOCNO_UNIT_SIZE): Ditto.
(ALLOCNO_TRACK_SUBREG_P): Ditto.
(ALLOCNO_NREGS): Ditto.
(OBJECT_SUBWORD): Ditto.
(OBJECT_INDEX): Ditto.
(OBJECT_START): Ditto.
(OBJECT_NREGS): Ditto.
(find_object): Exported.
(find_object_anyway): Ditto.
(ira_copy_allocno_objects): Ditto.
(has_subreg_object_p): Ditto.
(get_full_object): Ditto.
* ira-lives.cc (INCLUDE_VECTOR): Include vector.
(add_onflict_hard_regs): New.
(add_onflict_hard_reg): New.
(make_hard_regno_dead): Adjust.
(make_object_live): Adjust.
(update_allocno_pressure_excess_length): Adjust.
(make_object_dead): Adjust.
(mark_pseudo_regno_live): Adjust.
(add_subreg_point): New.
(mark_pseudo_object_live): Adjust.
(mark_pseudo_regno_subword_live): Adjust.
(mark_pseudo_regno_subreg_live): Adjust.
(mark_pseudo_regno_subregs_live): Adjust.
(mark_pseudo_reg_live): Adjust.
(mark_pseudo_regno_dead): Adjust.
(mark_pseudo_object_dead): Adjust.
(mark_pseudo_regno_subword_dead): Adjust.
(mark_pseudo_regno_subreg_dead): Adjust.
(mark_pseudo_reg_dead): Adjust.
(process_single_reg_class_operands): Adjust.
(process_out_of_region_eh_regs): Adjust.
(add_conflict_from_region_landing_pads): Adjust.
(process_bb_node_lives): Adjust.
(class subreg_live_item): New class.
(create_subregs_live_ranges): New function.
(ira_create_allocno_live_ranges): Adjust.
* ira.cc (check_allocation): Adjust.

---
 gcc/hard-reg-set.h   |  33 +++
 gcc/ira-build.cc | 235 +---
 gcc/ira-color.cc | 302 +-
 gcc/ira-conflicts.cc |  48 ++---
 gcc/ira-emit.cc  |   2 +-
 gcc/ira-int.h|  57 -
 gcc/ira-lives.cc | 500 ---
 gcc/ira.cc   |  52 ++---
 8 files changed, 907 insertions(+), 322 deletions(-)

diff --git a/gcc/hard-reg-set.h b/gcc/hard-reg-set.h
index b0bb9bce074..760eadba186 100644
--- a/gcc/hard-reg-set.h
+++ b/gcc/hard-reg-set.h
@@ -113,6 +113,39 @@ struct HARD_REG_SET
 return !operator== (other);
   }
 
+  HARD_REG_SET
+  operator>> (unsigned int shift_amount) const
+  {
+if (shift_amount == 0)
+  return *this;
+
+HARD_REG_SET res;
+unsigned int total_bits = sizeof (HARD_REG_ELT_TYPE) * 8;
+if (shift_amount >= total_bits)
+  {
+   unsigned int n_elt = shift_amount % total_bits;
+   shift_amount -= n_elt * total_bits;
+   for (unsigned int i = 0; i < ARRAY_SIZE (elts) - n_elt - 1; i += 1)
+ res.elts[i] = elts[i + n_elt];
+   /* clear upper n_elt elements.  */
+   for (unsigned int i = 0; i < n_elt; i += 1)
+ res.elts[ARRAY_SIZE (elts) - 1 - i] = 0;
+  }
+
+if (shift_amount > 0)
+  {
+   /* The left bits of an element be

[PATCH V2 2/7] ira: Switch to live_subreg data

2023-11-12 Thread Lehua Ding
This patch switch the use of live_reg data to live_subreg data.

gcc/ChangeLog:

* ira-build.cc (create_bb_allocnos): Switch.
(create_loop_allocnos): Ditto.
* ira-color.cc (ira_loop_edge_freq): Ditto.
* ira-emit.cc (generate_edge_moves): Ditto.
(add_ranges_and_copies): Ditto.
* ira-lives.cc (process_out_of_region_eh_regs): Ditto.
(add_conflict_from_region_landing_pads): Ditto.
(process_bb_node_lives): Ditto.
* ira.cc (find_moveable_pseudos): Ditto.
(interesting_dest_for_shprep_1): Ditto.
(allocate_initial_values): Ditto.
(ira): Ditto.

---
 gcc/ira-build.cc |  7 ---
 gcc/ira-color.cc |  8 
 gcc/ira-emit.cc  | 12 ++--
 gcc/ira-lives.cc |  7 ---
 gcc/ira.cc   | 16 +---
 5 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/gcc/ira-build.cc b/gcc/ira-build.cc
index 93e46033170..f931c6e304c 100644
--- a/gcc/ira-build.cc
+++ b/gcc/ira-build.cc
@@ -1919,7 +1919,8 @@ create_bb_allocnos (ira_loop_tree_node_t bb_node)
   create_insn_allocnos (PATTERN (insn), NULL, false);
   /* It might be a allocno living through from one subloop to
  another.  */
-  EXECUTE_IF_SET_IN_REG_SET (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, i, bi)
+  EXECUTE_IF_SET_IN_REG_SET (DF_LIVE_SUBREG_IN (bb), FIRST_PSEUDO_REGISTER,
+i, bi)
 if (ira_curr_regno_allocno_map[i] == NULL)
   ira_create_allocno (i, false, ira_curr_loop_tree_node);
 }
@@ -1935,9 +1936,9 @@ create_loop_allocnos (edge e)
   bitmap_iterator bi;
   ira_loop_tree_node_t parent;
 
-  live_in_regs = df_get_live_in (e->dest);
+  live_in_regs = DF_LIVE_SUBREG_IN (e->dest);
   border_allocnos = ira_curr_loop_tree_node->border_allocnos;
-  EXECUTE_IF_SET_IN_REG_SET (df_get_live_out (e->src),
+  EXECUTE_IF_SET_IN_REG_SET (DF_LIVE_SUBREG_OUT (e->src),
 FIRST_PSEUDO_REGISTER, i, bi)
 if (bitmap_bit_p (live_in_regs, i))
   {
diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index f2e8ea34152..4aa3e316282 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -2783,8 +2783,8 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
   FOR_EACH_EDGE (e, ei, loop_node->loop->header->preds)
if (e->src != loop_node->loop->latch
&& (regno < 0
-   || (bitmap_bit_p (df_get_live_out (e->src), regno)
-   && bitmap_bit_p (df_get_live_in (e->dest), regno
+   || (bitmap_bit_p (DF_LIVE_SUBREG_OUT (e->src), regno)
+   && bitmap_bit_p (DF_LIVE_SUBREG_IN (e->dest), regno
  freq += EDGE_FREQUENCY (e);
 }
   else
@@ -2792,8 +2792,8 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
   auto_vec edges = get_loop_exit_edges (loop_node->loop);
   FOR_EACH_VEC_ELT (edges, i, e)
if (regno < 0
-   || (bitmap_bit_p (df_get_live_out (e->src), regno)
-   && bitmap_bit_p (df_get_live_in (e->dest), regno)))
+   || (bitmap_bit_p (DF_LIVE_SUBREG_OUT (e->src), regno)
+   && bitmap_bit_p (DF_LIVE_SUBREG_IN (e->dest), regno)))
  freq += EDGE_FREQUENCY (e);
 }
 
diff --git a/gcc/ira-emit.cc b/gcc/ira-emit.cc
index bcc4f09f7c4..84ed482e568 100644
--- a/gcc/ira-emit.cc
+++ b/gcc/ira-emit.cc
@@ -510,8 +510,8 @@ generate_edge_moves (edge e)
 return;
   src_map = src_loop_node->regno_allocno_map;
   dest_map = dest_loop_node->regno_allocno_map;
-  regs_live_in_dest = df_get_live_in (e->dest);
-  regs_live_out_src = df_get_live_out (e->src);
+  regs_live_in_dest = DF_LIVE_SUBREG_IN (e->dest);
+  regs_live_out_src = DF_LIVE_SUBREG_OUT (e->src);
   EXECUTE_IF_SET_IN_REG_SET (regs_live_in_dest,
 FIRST_PSEUDO_REGISTER, regno, bi)
 if (bitmap_bit_p (regs_live_out_src, regno))
@@ -1229,16 +1229,16 @@ add_ranges_and_copies (void)
 destination block) to use for searching allocnos by their
 regnos because of subsequent IR flattening.  */
   node = IRA_BB_NODE (bb)->parent;
-  bitmap_copy (live_through, df_get_live_in (bb));
+  bitmap_copy (live_through, DF_LIVE_SUBREG_IN (bb));
   add_range_and_copies_from_move_list
(at_bb_start[bb->index], node, live_through, REG_FREQ_FROM_BB (bb));
-  bitmap_copy (live_through, df_get_live_out (bb));
+  bitmap_copy (live_through, DF_LIVE_SUBREG_OUT (bb));
   add_range_and_copies_from_move_list
(at_bb_end[bb->index], node, live_through, REG_FREQ_FROM_BB (bb));
   FOR_EACH_EDGE (e, ei, bb->succs)
{
- bitmap_and (live_through,
- df_get_live_in (e->dest), df_get_live_out (bb));
+ bitmap_and (live_through, DF_LIVE_SUBREG_IN (e->dest),
+ DF_LIVE_SUBREG_OUT (bb));
  add_range_and_copies_from_move_list
((move_t) e->aux, node, live_through,
 REG_FREQ_FROM_EDGE_FREQ (EDGE_FREQUE

Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-11-12 Thread Sam James


Alexander Monakov  writes:

> On Sat, 11 Nov 2023, Sam James wrote:
>
>> > Valgrind client requests are offered as macros that emit inline asm.  For 
>> > use
>> > in code generation, we need to wrap it in a built-in.  We know that 
>> > implementing
>> > such a built-in in libgcc is undesirable, [...].
>> 
>> Perhaps less objectionable than you think, at least given the new CFR
>> stuff from oliva from the other week that landed.
>
> Yeah; we haven't found any better solution anyway.
>
>> This is a really neat idea (it also makes me wonder if there's any other
>> opportunities for Valgrind integration like this?).
>
> To (attempt to) answer the parenthetical question, note that the patch is not
> limited to instrumenting C++ cdtors, it annotates all lifetime CLOBBER marks,
> so Valgrind should see lifetime boundaries of various on-stack arrays too.

Oh, right!

>
> (I hope positioning the new pass after build_ssa is sufficient to avoid
> annotating too much, like non-address-taken local scalars)
>
>> LLVM was the most recent example but it wasn't the first, and this has
>> come up with LLVM in a few other places too (same root cause, wasn't
>> obvious at all).
>
> I'm very curious what you mean by "this has come up with LLVM [] too": ttbomk,
> LLVM doesn't do such lifetime-based optimization yet, which is why compiling
> LLVM with LLVM doesn't break it. Can you share some examples? Or do you mean
> instances when libLLVM-miscompiled-with-GCC was linked elsewhere, and that
> program crashed mysteriously as a result?
>
> Indeed this work is inspired by the LLVM incident in PR 106943.

For that part, I meant that we had a _lot_ of different variations of
the LLVM miscompilation over the years where people would report issues
when building LLVM with GCC but trying to investigate it didn't get very
far.

i.e. It was very hard to debug and something like this would've made
things substantially easier (it takes us from the realm of "uhh maybe
compiler bug" to provable UB, which we're very used to handling).

It doesn't help that the fact that ubsan and friends can't find this
means it's often not-determined and may be worked around with either
disabling LTO or disabling various passes as a heavy hammer.

I didn't think to try toggling it when I hit issues until PR 106943.

> we don't see many other instances with -flifetime-dse workarounds in public.
> Grepping Gentoo Portage reveals only openjade. Arch applies the workaround to
> a jvm package too, and we know that Firefox and LLVM apply it on their own.
>

I had some vague memories in the back of my head so I went digging
because I enjoy this:

* Qt has hit this in the past, and I actually wonder if it's the cause
of a few other nebulous LTO issues (which we've never had a solid report
for) as well:
https://codereview.qt-project.org/c/qt/qtdeclarative/+/176272
https://bugs.gentoo.org/584818
https://bugs.gentoo.org/626070

We've given up for Qt 5 and I plan on revisiting any possible issues
with LTO with Qt 6 instead.

* Firefox as you noted: https://bugzilla.mozilla.org/show_bug.cgi?id=1232696.
* TBB (no real details available, unfortunately): 
https://github.com/oneapi-src/oneTBB/commit/51c0b2f742920535178560f31c6e91065bf87b41
* crypto++ has had a series of mysterious miscompilations and I suspect
  it may be a victim here. See 
https://github.com/weidai11/cryptopp/issues/1141#issuecomment-1208169530
  onwards but it's not the first bug like this crypto++ has had.
* codeblocks: https://bugs.gentoo.org/625696
* coin: https://bugs.gentoo.org/619378

(I would not be surprised if other wxwidgets applications are victims,
e.g. older kicad or audacity.)

Some of the results for GCC 6 at
https://bugs.gentoo.org/showdependencytree.cgi?id=582084&hide_resolved=0
were from other optimisation changes, but some are DSE (and some might
have been DSE but masked by another flag as I mentioned earlier).

If someone is extremely bored, they may want to look through our LTO tracker /
various -fno-lto/filter-lto grep results in gentoo.git to see if any
of them seem fishy. In the past, people would filter LTO without
examining the real problem. I am trying to fix this but you can't do that 
overnight.

Another good heuristic, I bet, is anything passing -O1, filtering -O3,
or if you want a general sense of crustiness, where the ebuild has to pass
-std=c++98 or so. But I'm just trying to give you fodder if you want
more examples with that last suggestion.

With less detail, I have mentions for the following in some git checkouts
from other distributions. I could not find any bug references:
* injeqt (Void Linux)
* opencollada (Void Linux)

> This patch finds the issue in LLVM and openjade; testing it on Spidermonkey
> is TODO. Suggestions for other interesting tests would be welcome.
>
>> > --- a/libgcc/configure.ac
>> > +++ b/libgcc/configure.ac
>> > @@ -269,6 +269,54 @@ GCC_CHECK_SJLJ_EXCEPTIONS
>> >  GCC_CET_FLAGS(CET_FLAGS)
>> >  AC_SUBST(CET_FLAGS)
>> >  
>> > +AC_CHECK_HEAD

Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-11-12 Thread Alexander Monakov


On Sat, 11 Nov 2023, Sam James wrote:

> > Valgrind client requests are offered as macros that emit inline asm.  For 
> > use
> > in code generation, we need to wrap it in a built-in.  We know that 
> > implementing
> > such a built-in in libgcc is undesirable, [...].
> 
> Perhaps less objectionable than you think, at least given the new CFR
> stuff from oliva from the other week that landed.

Yeah; we haven't found any better solution anyway.

> This is a really neat idea (it also makes me wonder if there's any other
> opportunities for Valgrind integration like this?).

To (attempt to) answer the parenthetical question, note that the patch is not
limited to instrumenting C++ cdtors, it annotates all lifetime CLOBBER marks,
so Valgrind should see lifetime boundaries of various on-stack arrays too.

(I hope positioning the new pass after build_ssa is sufficient to avoid
annotating too much, like non-address-taken local scalars)

> LLVM was the most recent example but it wasn't the first, and this has
> come up with LLVM in a few other places too (same root cause, wasn't
> obvious at all).

I'm very curious what you mean by "this has come up with LLVM [] too": ttbomk,
LLVM doesn't do such lifetime-based optimization yet, which is why compiling
LLVM with LLVM doesn't break it. Can you share some examples? Or do you mean
instances when libLLVM-miscompiled-with-GCC was linked elsewhere, and that
program crashed mysteriously as a result?

Indeed this work is inspired by the LLVM incident in PR 106943. Unforunately
we don't see many other instances with -flifetime-dse workarounds in public.
Grepping Gentoo Portage reveals only openjade. Arch applies the workaround to
a jvm package too, and we know that Firefox and LLVM apply it on their own.

This patch finds the issue in LLVM and openjade; testing it on Spidermonkey
is TODO. Suggestions for other interesting tests would be welcome.

> > --- a/libgcc/configure.ac
> > +++ b/libgcc/configure.ac
> > @@ -269,6 +269,54 @@ GCC_CHECK_SJLJ_EXCEPTIONS
> >  GCC_CET_FLAGS(CET_FLAGS)
> >  AC_SUBST(CET_FLAGS)
> >  
> > +AC_CHECK_HEADER(valgrind.h, have_valgrind_h=yes, have_valgrind_h=no)
> 
> Consider using PKG_CHECK_MODULES and falling back to a manual search.

Thanks. autotools bits in this patch are one-to-one copy of the pre-existing
Valgrind detection in the 'gcc' subdirectory where it's necessary for
running the compiler under Valgrind without false positives.

I guess the right solution is to move Valgrind detection into the top-level
'config' directory (and apply the cleanups you mention), but as we are not
familiar with autotools we just made the copy-paste for this RFC.

With the patch, --enable-valgrind-annotations becomes "overloaded" to
simultaneously instrument the compiler and enhance libgcc to support
-fvalgrind-emit-annotations, but those are independent and in practice
people may need the latter without the former.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-12 Thread Alexander Monakov

On Sat, 11 Nov 2023, Arsen Arsenović wrote:

> > +#else
> > +# define VALGRIND_MAKE_MEM_UNDEFINED(ptr, sz) __builtin_trap ()
> > +#endif
> > +
> > +void __valgrind_make_mem_undefined (void *ptr, unsigned long sz)
> > +{
> > +  VALGRIND_MAKE_MEM_UNDEFINED (ptr, sz);
> > +}
> 
> Would it be preferable to have a link-time error here if missing?

Indeed, thank you for the suggestion, will keep that in mind for resending.
That will allow to notice the problem earlier, and the user will be able
to drop in this snippet in their project to resolve the issue.

Alexander