from:"Uros Bizjak"

Re: [PATCH] x86: Disable stack protector for naked functions

2024-10-04 Thread Uros Bizjak

On Fri, Oct 4, 2024 at 2:11 PM H.J. Lu  wrote:
>
> Since naked functions should not enable stack protector, define
> TARGET_STACK_PROTECT_RUNTIME_ENABLED_P to disable stack protector
> for naked functions.
>
> gcc/
>
> PR target/116962
> * config/i386/i386.cc (ix86_stack_protect_runtime_enabled_p): New
> function.
> (TARGET_STACK_PROTECT_RUNTIME_ENABLED_P): New.
>
> gcc/testsuite/
>
> PR target/116962
> * gcc.target/i386/pr116962.c: New file.
>
> OK for master?

OK, also for backports.

Thanks,
Uros.

Re: [PATCH] ssa-math-opts, i386: Improve spaceship expansion [PR116896]

2024-10-04 Thread Uros Bizjak

On Fri, Oct 4, 2024 at 11:58 AM Jakub Jelinek  wrote:
>
> Hi!
>
> The PR notes that we don't emit optimal code for C++ spaceship
> operator if the result is returned as an integer rather than the
> result just being compared against different values and different
> code executed based on that.
> So e.g. for
> template 
> auto foo (T x, T y) { return x <=> y; }
> for both floating point types, signed integer types and unsigned integer
> types.  auto in that case is std::strong_ordering or std::partial_ordering,
> which are fancy C++ abstractions around struct with signed char member
> which is -1, 0, 1 for the strong ordering and -1, 0, 1, 2 for the partial
> ordering (but for -ffast-math 2 is never the case).
> I'm afraid functions like that are fairly common and unless they are
> inlined, we really need to map the comparison to those -1, 0, 1 or
> -1, 0, 1, 2 values.
>
> Now, for floating point spaceship I've in the past already added an
> optimization (with tree-ssa-math-opts.cc discovery and named optab, the
> optab only defined on x86 though right now), which ensures there is just
> a single comparison instruction and then just tests based on flags.
> Now, if we have code like:
>   auto a = x <=> y;
>   if (a == std::partial_ordering::less)
> bar ();
>   else if (a == std::partial_ordering::greater)
> baz ();
>   else if (a == std::partial_ordering::equivalent)
> qux ();
>   else if (a == std::partial_ordering::unordered)
> corge ();
> etc., that results in decent code generation, the spaceship named pattern
> on x86 optimizes for the jumps, so emits comparisons on the flags, followed
> by setting the result to -1, 0, 1, 2 and subsequent jump pass optimizes that
> well.  But if the result needs to be stored into an integer and just
> returned that way or there are no immediate jumps based on it (or turned
> into some non-standard integer values like -42, 0, 36, 75 etc.), then CE
> doesn't do a good job for that, we end up with say
> comiss  %xmm1, %xmm0
> jp  .L4
> seta%al
> movl$0, %edx
> leal-1(%rax,%rax), %eax
> cmove   %edx, %eax
> ret
> .L4:
> movl$2, %eax
> ret
> The jp is good, that is the unlikely case and can't be easily handled in
> straight line code due to the layout of the flags, but the rest uses cmov
> which often isn't a win and a weird math.
> With the patch below we can get instead
> xorl%eax, %eax
> comiss  %xmm1, %xmm0
> jp  .L2
> seta%al
> sbbl$0, %eax
> ret
> .L2:
> movl$2, %eax
> ret
>
> The patch changes the discovery in the generic code, by detecting if
> the future .SPACESHIP result is just used in a PHI with -1, 0, 1 or
> -1, 0, 1, 2 values (the latter for HONOR_NANS) and passes that as a flag in
> a new argument to .SPACESHIP ifn, so that the named pattern is told whether
> it should optimize for branches or for loading the result into a -1, 0, 1
> (, 2) integer.  Additionally, it doesn't detect just floating point <=>
> anymore, but also integer and unsigned integer, but in those cases only
> if an integer -1, 0, 1 is wanted (otherwise == and > or similar comparisons
> result in good code).
> The backend then can for those integer or unsigned integer <=>s return
> effectively (x > y) - (x < y) in a way that is efficient on the target
> (so for x86 with ensuring zero initialization first when needed before
> setcc; one for floating point and unsigned, where there is just one setcc
> and the second one optimized into sbb instruction, two for the signed int
> case).  So e.g. for signed int we now emit
> xorl%edx, %edx
> xorl%eax, %eax
> cmpl%esi, %edi
> setl%dl
> setg%al
> subl%edx, %eax
> ret
> and for unsigned
> xorl%eax, %eax
> cmpl%esi, %edi
> seta%al
> sbbb$0, %al
> ret
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> Note, I wonder if other targets wouldn't benefit from defining the
> named optab too...
>
> 2024-10-04  Jakub Jelinek  
>
> PR middle-end/116896
> * optabs.def (spaceship_optab): Use spaceship$a4 rather than
> spaceship$a3.
> * internal-fn.cc (expand_SPACESHIP): Expect 3 call arguments
> rather than 2, expand the last one, expect 4 operands of
> spaceship_optab.
> * tree-ssa-math-opts.cc: Include cfghooks.h.
> (optimize_spaceship): Check if a single PHI is initialized to
> -1, 0, 1, 2 or -1, 0, 1 values, in that case pass 1 as last (new)
> argument to .SPACESHIP and optimize away the comparisons,
> otherwise pass 0.  Also check for integer comparisons rather than
> floating point, in that case do it only if there is a single PHI
> with -1, 0, 1 values and pass 1 to last argument of .SPACESHIP
> if t

Re: [PATCH] i386: Fix up ix86_expand_int_compare with TImode comparisons of SUBREGs from V8{H,B}Fmode against zero [PR116921]

2024-10-04 Thread Uros Bizjak

On Fri, Oct 4, 2024 at 12:12 PM Jakub Jelinek  wrote:
>
> Hi!
>
> The following testcase ICEs, because the ix86_expand_int_compare
> optimization to use {,v}ptest assumes there are instructions for all
> 16-byte vector modes.  That isn't the case, we only have one for
> V16QI, V8HI, V4SI, V2DI, V1TI, V4SF and V2DF, not for
> V8HF nor V8BF.
>
> The following patch fixes that by using the V8HI instruction instead
> for those 2 modes.  tmp can't be a SUBREG, because it is SUBREG_REG
> of another SUBREG, so we don't need to worry about gen_lowpart
> failing.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2024-10-04  Jakub Jelinek  
>
> PR target/116921
> * config/i386/i386-expand.cc (ix86_expand_int_compare): Add a SUBREG
> to V8HImode from V8HFmode or V8BFmode before generating a ptest.
>
> * gcc.target/i386/pr116921.c: New test.

OK.

Thanks,
Uros.
>
> --- gcc/config/i386/i386-expand.cc.jj   2024-10-03 17:27:28.328227793 +0200
> +++ gcc/config/i386/i386-expand.cc  2024-10-03 18:11:18.514076904 +0200
> @@ -3095,6 +3095,9 @@ ix86_expand_int_compare (enum rtx_code c
>&& GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
>  {
>tmp = SUBREG_REG (op0);
> +  if (GET_MODE_INNER (GET_MODE (tmp)) == HFmode
> + || GET_MODE_INNER (GET_MODE (tmp)) == BFmode)
> +   tmp = gen_lowpart (V8HImode, tmp);
>tmp = gen_rtx_UNSPEC (CCZmode, gen_rtvec (2, tmp, tmp), UNSPEC_PTEST);
>  }
>else
> --- gcc/testsuite/gcc.target/i386/pr116921.c.jj 2024-10-03 18:16:36.368711747 
> +0200
> +++ gcc/testsuite/gcc.target/i386/pr116921.c2024-10-03 18:17:25.702034243 
> +0200
> @@ -0,0 +1,12 @@
> +/* PR target/116921 */
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O2 -msse4" } */
> +
> +long x;
> +_Float16 __attribute__((__vector_size__ (16))) f;
> +
> +void
> +foo (void)
> +{
> +  x -= !(__int128) (f / 2);
> +}
>
> Jakub
>

Re: [PATCH] i386: Fix up *minmax3_2 splitter [PR116925]

2024-10-04 Thread Uros Bizjak

On Fri, Oct 4, 2024 at 12:07 PM Jakub Jelinek  wrote:
>
> Hi!
>
> While *minmax3_1 correctly uses
>if (MEM_P (operands[1]))
>  operands[1] = force_reg (mode, operands[1]);
> to ensure operands[1] is not a MEM, *minmax3_2 does it wrongly
> by calling force_reg but ignoring its return value.
>
> The following borderingly obvious patch fixes that, bootstrapped/regtested
> on x86_64-linux and i686-linux, ok for trunk?
>
> Didn't find similar other errors in the backend with force_reg calls.
>
> 2024-10-04  Jakub Jelinek  
>
> PR target/116925
> * config/i386/sse.md (*minmax3_2): Assign force_reg result
> back to operands[2] instead of throwing it away.
>
> * g++.target/i386/avx-pr116925.C: New test.

OK, even obvious.

Thanks,
Uros.

>
> --- gcc/config/i386/sse.md.jj   2024-10-01 09:38:57.0 +0200
> +++ gcc/config/i386/sse.md  2024-10-03 17:33:12.071507421 +0200
> @@ -3269,7 +3269,7 @@ (define_insn_and_split "*minmax3_2
>   u = UNSPEC_IEEE_MAX;
>
> if (MEM_P (operands[2]))
> - force_reg (mode, operands[2]);
> + operands[2] = force_reg (mode, operands[2]);
> rtvec v = gen_rtvec (2, operands[2], operands[1]);
> rtx tmp = gen_rtx_UNSPEC (mode, v, u);
> emit_move_insn (operands[0], tmp);
> --- gcc/testsuite/g++.target/i386/avx-pr116925.C.jj 2024-10-03 
> 17:36:10.12406 +0200
> +++ gcc/testsuite/g++.target/i386/avx-pr116925.C2024-10-03 
> 17:35:26.805656671 +0200
> @@ -0,0 +1,12 @@
> +// PR target/116925
> +// { dg-do compile }
> +// { dg-options "-O2 -mavx -ffloat-store" }
> +
> +typedef float V __attribute__((vector_size (16)));
> +V a, b, c;
> +
> +void
> +foo ()
> +{
> +  c = a > b ? a : b;
> +}
>
> Jakub
>

Re: [RFC PATCH] Allow limited extended asm at toplevel

2024-10-02 Thread Uros Bizjak

On Wed, Oct 2, 2024 at 1:03 PM Jakub Jelinek  wrote:
>
> Hi!
>
> In the Cauldron IPA/LTO BoF we've discussed toplevel asms and it was
> discussed it would be nice to tell the compiler something about what
> the toplevel asm does.  Sure, I'm aware the kernel people said they
> aren't willing to use something like that, but perhaps other projects
> do.  And for kernel perhaps we should add some new option which allows
> some dumb parsing of the toplevel asms and gather something from that
> parsing.
>
> The following patch is just a small step towards that, namely, allow
> some subset of extended inline asm outside of functions.
> The patch is unfinished, LTO streaming (out/in) of the ASM_EXPRs isn't
> implemented, nor any cgraph/varpool changes to find out references etc.
>
> The patch allows something like:
>
> int a[2], b;
> enum { E1, E2, E3, E4, E5 };
> struct S { int a; char b; long long c; };
> asm (".section blah; .quad %P0, %P1, %P2, %P3, %P4; .previous"
>  : : "m" (a), "m" (b), "i" (42), "i" (E4), "i" (sizeof (struct S)));
>
> Even for non-LTO, that could be useful e.g. for getting enumerators from
> C/C++ as integers into the toplevel asm, or sizeof/offsetof etc.
>
> The restrictions I've implemented are:
> 1) asm qualifiers aren't still allowed, so asm goto or asm inline can't be
>specified at toplevel, asm volatile has the volatile ignored for C++ with
>a warning and is an error in C like before
> 2) I see good use for mainly input operands, output maybe to make it clear
>that the inline asm may write some memory, I don't see a good use for
>clobbers, so the patch doesn't allow those (and of course labels because
>asm goto can't be specified)
> 3) the patch allows only constraints which don't allow registers, so
>typically "m" or "i" or other memory or immediate constraints; for
>memory, it requires that the operand is addressable and its address
>could be used in static var initializer (so that no code actually
>needs to be emitted for it), for others that they are constants usable
>in the static var initializers
> 4) the patch disallows + (there is no reload of the operands, so I don't
>see benefits of tying some operands together), nor % (who cares if
>something is commutative in this case), or & (again, no code is emitted
>around the asm), nor the 0-9 constraints
>
> Right now there is no way to tell the compiler that the inline asm defines
> some symbol, I wonder if we can find some unused constraint letter or
> sequence or other syntax to denote that.  Note, I think we want to be
> able to specify that an inline asm defines a function or variable and
> be able to provide the type etc. thereof.  So
> extern void foo (void);
> extern int var;
> asm ("%P0: ret" : : "defines" (foo));
> asm ("%P0: .quad 0" : : "defines" (var));
> where the exact "defines" part is TBD.
>
> Another question is whether all targets have something like x86 P print
> modifier which doesn't add any stuff around the printed expressions
> (perhaps there are targets which don't do that just for %0 etc.), or
> whether we want to add something that will be usable on all targets.

%P is very x86 specific, perhaps you should use %c instead?

The %c modifier is described in:

6.48.2.8 Generic Operand Modifiers
..

The following table shows the modifiers supported by all targets and
their effects:

ModifierDescriptionExample
---
‘c’ Require a constant operand and print the   ‘%c0’
   constant expression with no punctuation.
...

E.g.:

void bar (void);
void foo (void)
{
  asm ("%c0" :  : "i"(bar));
}

generates:

#APP
# 5 "c.c" 1
   bar
# 0 "" 2
#NO_APP

Uros.

Re: [PATCH] x86: Don't use address override with segment regsiter

2024-09-28 Thread Uros Bizjak

On Sat, Sep 28, 2024 at 5:23 AM H.J. Lu  wrote:
>
> On Wed, Sep 25, 2024, 7:06 PM Uros Bizjak  wrote:
>>
>> On Wed, Sep 25, 2024 at 11:42 AM H.J. Lu  wrote:
>> >
>> > Address override only applies to the (reg32) part in the thread address
>> > fs:(reg32).  Don't rewrite thread address like
>> >
>> > (set (reg:CCZ 17 flags)
>> > (compare:CCZ (reg:SI 98 [ __gmpfr_emax.0_1 ])
>> > (mem/c:SI (plus:SI (plus:SI (unspec:SI [
>> > (const_int 0 [0])
>> > ] UNSPEC_TP)
>> > (reg:SI 107))
>> > (const:SI (unspec:SI [
>> > (symbol_ref:SI ("previous_emax") [flags 0x1a] 
>> > )
>> > ] UNSPEC_DTPOFF))) [1 previous_emax+0 S4 A32])))
>> >
>> > if address override is used to avoid the invalid memory operand like
>> >
>> > cmpl%fs:previous_emax@dtpoff(%eax), %r12
>> >
>> > gcc/
>> >
>> > PR target/116839
>> > * config/i386/i386.cc (ix86_rewrite_tls_address_1): Make it
>> > static.  Return if TLS address is thread register plus an integer
>> > register.
>> >
>> > gcc/testsuite/
>> >
>> > PR target/116839
>> > * gcc.target/i386/pr116839.c: New file.
>>
>> OK.
>
>
> OK to backport?

Also OK.

Thanks,
Uros.

>
> Thanks.
>
> H.J.
>>
>>
>> Thanks,
>> Uros.
>>
>> >
>> > Signed-off-by: H.J. Lu 
>> > ---
>> >  gcc/config/i386/i386.cc  |  9 -
>> >  gcc/testsuite/gcc.target/i386/pr116839.c | 48 
>> >  2 files changed, 56 insertions(+), 1 deletion(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr116839.c
>> >
>> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
>> > index 2f736a3b346..cfa84ed013d 100644
>> > --- a/gcc/config/i386/i386.cc
>> > +++ b/gcc/config/i386/i386.cc
>> > @@ -12469,7 +12469,7 @@ ix86_tls_address_pattern_p (rtx op)
>> >  }
>> >
>> >  /* Rewrite *LOC so that it refers to a default TLS address space.  */
>> > -void
>> > +static void
>> >  ix86_rewrite_tls_address_1 (rtx *loc)
>> >  {
>> >subrtx_ptr_iterator::array_type array;
>> > @@ -12491,6 +12491,13 @@ ix86_rewrite_tls_address_1 (rtx *loc)
>> >   if (GET_CODE (u) == UNSPEC
>> >   && XINT (u, 1) == UNSPEC_TP)
>> > {
>> > + /* NB: Since address override only applies to the
>> > +(reg32) part in fs:(reg32), return if address
>> > +override is used.  */
>> > + if (Pmode != word_mode
>> > + && REG_P (XEXP (*x, 1 - i)))
>> > +   return;
>> > +
>> >   addr_space_t as = DEFAULT_TLS_SEG_REG;
>> >
>> >   *x = XEXP (*x, 1 - i);
>> > diff --git a/gcc/testsuite/gcc.target/i386/pr116839.c 
>> > b/gcc/testsuite/gcc.target/i386/pr116839.c
>> > new file mode 100644
>> > index 000..e5df8256251
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/i386/pr116839.c
>> > @@ -0,0 +1,48 @@
>> > +/* { dg-do compile { target { ! ia32 } } } */
>> > +/* { dg-require-effective-target maybe_x32 } */
>> > +/* { dg-options "-mx32 -O2 -fPIC -mtls-dialect=gnu2" } */
>> > +/* { dg-final { scan-assembler-not "cmpl\[ 
>> > \t\]+%fs:previous_emax@dtpoff\\(%eax\\)" } } */
>> > +
>> > +typedef long mpfr_prec_t;
>> > +typedef long mpfr_exp_t;
>> > +typedef struct {
>> > +  mpfr_prec_t _mpfr_prec;
>> > +} __mpfr_struct;
>> > +typedef __mpfr_struct mpfr_t[1];
>> > +extern _Thread_local mpfr_exp_t __gmpfr_emax;
>> > +static _Thread_local mpfr_exp_t previous_emax;
>> > +static _Thread_local mpfr_t bound_emax;
>> > +extern const mpfr_t __gmpfr_const_log2_RNDD;
>> > +extern const mpfr_t __gmpfr_const_log2_RNDU;
>> > +
>> > +typedef enum {
>> > +  MPFR_RNDN=0,
>> > +  MPFR_RNDZ,
>> > +  MPFR_RNDU,
>> > +  MPFR_RNDD,
>> > +  MPFR_RNDA,
>> > +  MPFR_RNDF,
>> > +  MPFR_RNDNA=-1
>> > +} mpfr_rnd_t;
>> > +typedef __mpfr_struct *mpfr_ptr;
>> > +typedef const __mpfr_struct *mpfr_srcptr;
>> > +void mpfr_mul (mpfr_ptr, mpfr_srcptr, mpfr_rnd_t);
>> > +
>> > +void
>> > +foo (void)
>> > +{
>> > +  mpfr_exp_t saved_emax;
>> > +
>> > +  if (__gmpfr_emax != previous_emax)
>> > +{
>> > +  saved_emax = __gmpfr_emax;
>> > +
>> > +  bound_emax->_mpfr_prec = 32;
>> > +
>> > +  mpfr_mul (bound_emax, saved_emax < 0 ?
>> > +__gmpfr_const_log2_RNDD : __gmpfr_const_log2_RNDU,
>> > +MPFR_RNDU);
>> > +  previous_emax = saved_emax;
>> > +  __gmpfr_emax = saved_emax;
>> > +}
>> > +}
>> > --
>> > 2.46.1
>> >
>>

[committed] i386: Modernize AMD processor types

2024-09-27 Thread Uros Bizjak

Use iterative PTA definitions for members of the same AMD processor family.

Also, fix a couple of related M_CPU_TYPE/M_CPU_SUBTYPE inconsistencies.

No functional changes intended.

gcc/ChangeLog:

* config/i386/i386.h: Add PTA_BDVER1, PTA_BDVER2, PTA_BDVER3,
PTA_BDVER4, PTA_BTVER1 and PTA_BTVER2.
* common/config/i386/i386-common.cc (processor_alias_table)
<"bdver1">: Use PTA_BDVER1.
<"bdver2">: Use PTA_BDVER2.
<"bdver3">: Use PTA_BDVER3.
<"bdver4">: Use PTA_BDVER4.
<"btver1">: Use PTA_BTVER1.  Use M_CPU_TYPE (AMD_BTVER1).
<"btver2">: Use PTA_BTVER2.
<"shanghai>: Use M_CPU_SUBTYPE (AMDFAM10H_SHANGHAI).
<"istanbul>: Use M_CPU_SUBTYPE (AMDFAM10H_ISTANBUL).

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/common/config/i386/i386-common.cc 
b/gcc/common/config/i386/i386-common.cc
index fb744319b05..3f2fc599009 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -2348,34 +2348,16 @@ const pta processor_alias_table[] =
   | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_PRFCHW | PTA_FXSR,
 M_CPU_SUBTYPE (AMDFAM10H_BARCELONA), P_PROC_DYNAMIC},
   {"bdver1", PROCESSOR_BDVER1, CPU_BDVER1,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
-  | PTA_XOP | PTA_LWP | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE,
-M_CPU_TYPE (AMDFAM15H_BDVER1), P_PROC_XOP},
+PTA_BDVER1,
+M_CPU_SUBTYPE (AMDFAM15H_BDVER1), P_PROC_XOP},
   {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
-  | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
-  | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE,
-M_CPU_TYPE (AMDFAM15H_BDVER2), P_PROC_FMA},
+PTA_BDVER2,
+M_CPU_SUBTYPE (AMDFAM15H_BDVER2), P_PROC_FMA},
   {"bdver3", PROCESSOR_BDVER3, CPU_BDVER3,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
-  | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
-  | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
-  | PTA_XSAVEOPT | PTA_FSGSBASE,
+PTA_BDVER3,
 M_CPU_SUBTYPE (AMDFAM15H_BDVER3), P_PROC_FMA},
   {"bdver4", PROCESSOR_BDVER4, CPU_BDVER4,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2
-  | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_BMI2
-  | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR
-  | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE | PTA_RDRND
-  | PTA_MOVBE | PTA_MWAITX,
+PTA_BDVER4,
 M_CPU_SUBTYPE (AMDFAM15H_BDVER4), P_PROC_AVX2},
   {"znver1", PROCESSOR_ZNVER1, CPU_ZNVER1,
 PTA_ZNVER1,
@@ -2393,16 +2375,10 @@ const pta processor_alias_table[] =
 PTA_ZNVER5,
 M_CPU_SUBTYPE (AMDFAM1AH_ZNVER5), P_PROC_AVX512F},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSSE3 | PTA_SSE4A | PTA_ABM | PTA_CX16 | PTA_PRFCHW
-  | PTA_FXSR | PTA_XSAVE,
-   M_CPU_SUBTYPE (AMDFAM15H_BDVER1), P_PROC_SSE4_A},
+PTA_BTVER1,
+M_CPU_TYPE (AMD_BTVER1), P_PROC_SSE4_A},
   {"btver2", PROCESSOR_BTVER2, CPU_BTVER2,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-  | PTA_SSSE3 | PTA_SSE4A | PTA_ABM | PTA_CX16 | PTA_SSE4_1
-  | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
-  | PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW
-  | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT,
+PTA_BTVER2,
 M_CPU_TYPE (AMD_BTVER2), P_PROC_BMI},
 
   {"generic", PROCESSOR_GENERIC, CPU_GENERIC,
@@ -2421,9 +2397,9 @@ const pta processor_alias_table[] =
   {"amdfam19h", PROCESSOR_GENERIC, CPU_GENERIC, 0,
 M_CPU_TYPE (AMDFAM19H), P_NONE},
   {"shanghai", PROCESSOR_GENERIC, CPU_GENERIC, 0,
-M_CPU_TYPE (AMDFAM10H_SHANGHAI), P_NONE},
+M_CPU_SUBTYPE (AMDFAM10H_SHANGHAI), P_NONE},
   {"istanbul", PROCESSOR_GENERIC, CPU_GENERIC, 0,
-M_CPU_TYPE (AMDFAM10H_ISTANBUL), P_NONE},
+M_CPU_SUBTYPE (AMDFAM10H_ISTANBUL), P_NONE},
 };
 
 /* NB: processor_alias_table stops at the "generic" entry.  */
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 58e6f2826bf..82177b9d383 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2429,6 +2429,18 @@ constexpr wide_int_bitmask PTA_CLEARWATERFOREST = 
PTA_SIERRAFOREST
   | PTA_AVXVNNIINT16 | PTA_SHA512 | PTA_SM3 | PTA_SM4 | PTA_USER_MSR
   | PTA_PREFETCHI;
 constexpr wide_int_bitmask PTA_PANTHERLAKE = PTA_ARROWLAKE_S | PTA_PREFETCHI;
+
+constexpr wide_int_bitmask PTA_BDVER1 = PTA_64BIT | PTA_MMX | PTA_SSE
+  | PTA_SSE2 | PTA_SSE3 | PTA_

Re: [PATCH] x86: Don't use address override with segment regsiter

2024-09-25 Thread Uros Bizjak

On Wed, Sep 25, 2024 at 11:42 AM H.J. Lu  wrote:
>
> Address override only applies to the (reg32) part in the thread address
> fs:(reg32).  Don't rewrite thread address like
>
> (set (reg:CCZ 17 flags)
> (compare:CCZ (reg:SI 98 [ __gmpfr_emax.0_1 ])
> (mem/c:SI (plus:SI (plus:SI (unspec:SI [
> (const_int 0 [0])
> ] UNSPEC_TP)
> (reg:SI 107))
> (const:SI (unspec:SI [
> (symbol_ref:SI ("previous_emax") [flags 0x1a] 
> )
> ] UNSPEC_DTPOFF))) [1 previous_emax+0 S4 A32])))
>
> if address override is used to avoid the invalid memory operand like
>
> cmpl%fs:previous_emax@dtpoff(%eax), %r12
>
> gcc/
>
> PR target/116839
> * config/i386/i386.cc (ix86_rewrite_tls_address_1): Make it
> static.  Return if TLS address is thread register plus an integer
> register.
>
> gcc/testsuite/
>
> PR target/116839
> * gcc.target/i386/pr116839.c: New file.

OK.

Thanks,
Uros.

>
> Signed-off-by: H.J. Lu 
> ---
>  gcc/config/i386/i386.cc  |  9 -
>  gcc/testsuite/gcc.target/i386/pr116839.c | 48 
>  2 files changed, 56 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116839.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 2f736a3b346..cfa84ed013d 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -12469,7 +12469,7 @@ ix86_tls_address_pattern_p (rtx op)
>  }
>
>  /* Rewrite *LOC so that it refers to a default TLS address space.  */
> -void
> +static void
>  ix86_rewrite_tls_address_1 (rtx *loc)
>  {
>subrtx_ptr_iterator::array_type array;
> @@ -12491,6 +12491,13 @@ ix86_rewrite_tls_address_1 (rtx *loc)
>   if (GET_CODE (u) == UNSPEC
>   && XINT (u, 1) == UNSPEC_TP)
> {
> + /* NB: Since address override only applies to the
> +(reg32) part in fs:(reg32), return if address
> +override is used.  */
> + if (Pmode != word_mode
> + && REG_P (XEXP (*x, 1 - i)))
> +   return;
> +
>   addr_space_t as = DEFAULT_TLS_SEG_REG;
>
>   *x = XEXP (*x, 1 - i);
> diff --git a/gcc/testsuite/gcc.target/i386/pr116839.c 
> b/gcc/testsuite/gcc.target/i386/pr116839.c
> new file mode 100644
> index 000..e5df8256251
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116839.c
> @@ -0,0 +1,48 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-require-effective-target maybe_x32 } */
> +/* { dg-options "-mx32 -O2 -fPIC -mtls-dialect=gnu2" } */
> +/* { dg-final { scan-assembler-not "cmpl\[ 
> \t\]+%fs:previous_emax@dtpoff\\(%eax\\)" } } */
> +
> +typedef long mpfr_prec_t;
> +typedef long mpfr_exp_t;
> +typedef struct {
> +  mpfr_prec_t _mpfr_prec;
> +} __mpfr_struct;
> +typedef __mpfr_struct mpfr_t[1];
> +extern _Thread_local mpfr_exp_t __gmpfr_emax;
> +static _Thread_local mpfr_exp_t previous_emax;
> +static _Thread_local mpfr_t bound_emax;
> +extern const mpfr_t __gmpfr_const_log2_RNDD;
> +extern const mpfr_t __gmpfr_const_log2_RNDU;
> +
> +typedef enum {
> +  MPFR_RNDN=0,
> +  MPFR_RNDZ,
> +  MPFR_RNDU,
> +  MPFR_RNDD,
> +  MPFR_RNDA,
> +  MPFR_RNDF,
> +  MPFR_RNDNA=-1
> +} mpfr_rnd_t;
> +typedef __mpfr_struct *mpfr_ptr;
> +typedef const __mpfr_struct *mpfr_srcptr;
> +void mpfr_mul (mpfr_ptr, mpfr_srcptr, mpfr_rnd_t);
> +
> +void
> +foo (void)
> +{
> +  mpfr_exp_t saved_emax;
> +
> +  if (__gmpfr_emax != previous_emax)
> +{
> +  saved_emax = __gmpfr_emax;
> +
> +  bound_emax->_mpfr_prec = 32;
> +
> +  mpfr_mul (bound_emax, saved_emax < 0 ?
> +__gmpfr_const_log2_RNDD : __gmpfr_const_log2_RNDU,
> +MPFR_RNDU);
> +  previous_emax = saved_emax;
> +  __gmpfr_emax = saved_emax;
> +}
> +}
> --
> 2.46.1
>

Re: [PATCH] [x86] Define VECTOR_STORE_FLAG_VALUE

2024-09-24 Thread Uros Bizjak

On Tue, Sep 24, 2024 at 11:23 AM liuhongt  wrote:
>
> Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT.
> Otherwise NULL_RTX.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready push to trunk.
>
> gcc/ChangeLog:
>
> * config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro.
>
> gcc/testsuite/ChangeLog:
> * gcc.dg/rtl/x86_64/vector_eq.c: New test.
> ---
>  gcc/config/i386/i386.h  |  5 +++-
>  gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c | 26 +
>  2 files changed, 30 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index c1ec92ffb15..b12be41424f 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -899,7 +899,10 @@ extern const char *host_detect_local_cpu (int argc, 
> const char **argv);
> and give entire struct the alignment of an int.  */
>  /* Required on the 386 since it doesn't have bit-field insns.  */
>  #define PCC_BITFIELD_TYPE_MATTERS 1
> -
> +
> +#define VECTOR_STORE_FLAG_VALUE(MODE) \
> +  (GET_MODE_CLASS (MODE) == MODE_VECTOR_INT ? constm1_rtx : NULL_RTX)
> +
>  /* Standard register usage.  */
>
>  /* This processor has special stack-like registers.  See reg-stack.cc
> diff --git a/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c 
> b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
> new file mode 100644
> index 000..b82603d0b64
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target x86_64-*-* } } */

target { { i?86-*-* x86_64-*-* } && lp64 }

Uros.

> +/* { dg-additional-options "-O2 -march=x86-64-v3" } */
> +
> +typedef int v4si __attribute__((vector_size(16)));
> +
> +v4si __RTL (startwith ("vregs")) foo (void)
> +{
> +(function "foo"
> +  (insn-chain
> +(block 2
> +  (edge-from entry (flags "FALLTHRU"))
> +  (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK)
> +  (cnote 2 NOTE_INSN_FUNCTION_BEG)
> +  (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) 
> (const_int 0) (const_int 0) (const_int 0)])))
> +  (cinsn 5 (set (reg:V4SI <2>)
> +   (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>
> +  (cinsn 6 (set (reg:V4SI <3>) (reg:V4SI <2>)))
> +  (cinsn 7 (set (reg:V4SI xmm0) (reg:V4SI <3>)))
> +  (edge-to exit (flags "FALLTHRU"))
> +)
> +  )
> + (crtl (return_rtx (reg/i:V4SI xmm0)))
> +)
> +}
> +
> +/* { dg-final { scan-assembler-not "vpxor" } } */
> --
> 2.31.1
>

Re: [PATCH v1] Widening-Mul: Fix one ICE for SAT_SUB matching operand promotion

2024-09-24 Thread Uros Bizjak

On Tue, Sep 24, 2024 at 8:53 AM Li, Pan2  wrote:
>
> Got it and thanks, let me rerun to make sure it works well as expected.

For reference, this is documented in:

https://gcc.gnu.org/wiki/Testing_GCC
https://gcc-newbies-guide.readthedocs.io/en/latest/working-with-the-testsuite.html
https://gcc.gnu.org/install/test.html

Uros.

Re: [PATCH v1] Widening-Mul: Fix one ICE for SAT_SUB matching operand promotion

2024-09-23 Thread Uros Bizjak

On Tue, Sep 24, 2024 at 8:24 AM Li, Pan2  wrote:
>
> Thanks Uros for comments.
>
> > This is not "target", but "middle-end" component. Even though the bug
> > is exposed on x86_64 target, the fix is in the middle-end code, not in
> > the target code.
>
> Sure, will rename to middle-end.
>
> > Please remove -m32 and use "{ dg-do compile { target ia32 } }" instead.
>
> Is there any suggestion to run the "ia32" test when configure gcc build?
> I first leverage ia32 but complain UNSUPPORTED for this case.

You can add the following to your testsuite run:

RUNTESTFLAGS="--target-board=unix\{,-m32\}"

e.g:

make -j N -k check RUNTESTFLAGS=...

(where N is the number of make threads)

You can also add "dg.exp" or "dg.exp=pr12345.c" (or any other exp file
or testcase name) to RUNTESTFLAGS to run only one exp file or a single
test.

Uros.

> Pan
>
> -Original Message-
> From: Uros Bizjak 
> Sent: Tuesday, September 24, 2024 2:17 PM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; richard.guent...@gmail.com; 
> tamar.christ...@arm.com; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> jeffreya...@gmail.com; rdapp@gmail.com
> Subject: Re: [PATCH v1] Widening-Mul: Fix one ICE for SAT_SUB matching 
> operand promotion
>
> On Mon, Sep 23, 2024 at 4:58 PM  wrote:
> >
> > From: Pan Li 
> >
> > This patch would like to fix the following ICE for -O2 -m32 of x86_64.
> >
> > during RTL pass: expand
> > JackMidiAsyncWaitQueue.cpp.cpp: In function 'void DequeueEvent(unsigned
> > int)':
> > JackMidiAsyncWaitQueue.cpp.cpp:3:6: internal compiler error: in
> > expand_fn_using_insn, at internal-fn.cc:263
> > 3 | void DequeueEvent(unsigned frame) {
> >   |  ^~~~
> > 0x27b580d diagnostic_context::diagnostic_impl(rich_location*,
> > diagnostic_metadata const*, diagnostic_option_id, char const*,
> > __va_list_tag (*) [1], diagnostic_t)
> > ???:0
> > 0x27c4a3f internal_error(char const*, ...)
> > ???:0
> > 0x27b3994 fancy_abort(char const*, int, char const*)
> > ???:0
> > 0xf25ae5 expand_fn_using_insn(gcall*, insn_code, unsigned int, unsigned int)
> > ???:0
> > 0xf2a124 expand_direct_optab_fn(internal_fn, gcall*, optab_tag, unsigned 
> > int)
> > ???:0
> > 0xf2c87c expand_SAT_SUB(internal_fn, gcall*)
> > ???:0
> >
> > We allowed the operand convert when matching SAT_SUB in match.pd, to support
> > the zip benchmark SAT_SUB pattern.  Aka,
> >
> > (convert? (minus (convert1? @0) (convert1? @1))) for below sample code.
> >
> > void test (uint16_t *x, unsigned b, unsigned n)
> > {
> >   unsigned a = 0;
> >   register uint16_t *p = x;
> >
> >   do {
> > a = *--p;
> > *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
> >   } while (--n);
> > }
> >
> > The pattern match for SAT_SUB itself may also act on below scalar sample
> > code too.
> >
> > unsigned long long GetTimeFromFrames(int);
> > unsigned long long GetMicroSeconds();
> >
> > void DequeueEvent(unsigned frame) {
> >   long long frame_time = GetTimeFromFrames(frame);
> >   unsigned long long current_time = GetMicroSeconds();
> >   DequeueEvent(frame_time < current_time ? 0 : frame_time - current_time);
> > }
> >
> > Aka:
> >
> > uint32_t a = (uint32_t)SAT_SUB(uint64_t, uint64_t);
> >
> > Then there will be a problem when ia32 or -m32 is given when compiling.
> > Because we only check the lhs (aka uint32_t) type is supported by ifn
> > and missed the operand (aka uint64_t).  Mostly DImode is disabled for
> > 32 bits target like ia32 or rv32gcv, and then trigger ICE when expanding.
> >
> > The below test suites are passed for this patch.
> > * The rv64gcv fully regression test.
> > * The x86 bootstrap test.
> > * The x86 fully regression test.
> >
> > PR target/116814
>
> This is not "target", but "middle-end" component. Even though the bug
> is exposed on x86_64 target, the fix is in the middle-end code, not in
> the target code.
>
> > gcc/ChangeLog:
> >
> > * tree-ssa-math-opts.cc (build_saturation_binary_arith_call): Add
> > ifn is_supported check for operand TREE type.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.dg/torture/pr116814-1.C: New test.
> >
> > Signed-off-by: Pan Li 
> > ---
> >  gcc/testsuite/g++.dg/torture/pr116814-1.C | 12 
> &g

Re: [PATCH v1] Widening-Mul: Fix one ICE for SAT_SUB matching operand promotion

2024-09-23 Thread Uros Bizjak

On Mon, Sep 23, 2024 at 4:58 PM  wrote:
>
> From: Pan Li 
>
> This patch would like to fix the following ICE for -O2 -m32 of x86_64.
>
> during RTL pass: expand
> JackMidiAsyncWaitQueue.cpp.cpp: In function 'void DequeueEvent(unsigned
> int)':
> JackMidiAsyncWaitQueue.cpp.cpp:3:6: internal compiler error: in
> expand_fn_using_insn, at internal-fn.cc:263
> 3 | void DequeueEvent(unsigned frame) {
>   |  ^~~~
> 0x27b580d diagnostic_context::diagnostic_impl(rich_location*,
> diagnostic_metadata const*, diagnostic_option_id, char const*,
> __va_list_tag (*) [1], diagnostic_t)
> ???:0
> 0x27c4a3f internal_error(char const*, ...)
> ???:0
> 0x27b3994 fancy_abort(char const*, int, char const*)
> ???:0
> 0xf25ae5 expand_fn_using_insn(gcall*, insn_code, unsigned int, unsigned int)
> ???:0
> 0xf2a124 expand_direct_optab_fn(internal_fn, gcall*, optab_tag, unsigned int)
> ???:0
> 0xf2c87c expand_SAT_SUB(internal_fn, gcall*)
> ???:0
>
> We allowed the operand convert when matching SAT_SUB in match.pd, to support
> the zip benchmark SAT_SUB pattern.  Aka,
>
> (convert? (minus (convert1? @0) (convert1? @1))) for below sample code.
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> The pattern match for SAT_SUB itself may also act on below scalar sample
> code too.
>
> unsigned long long GetTimeFromFrames(int);
> unsigned long long GetMicroSeconds();
>
> void DequeueEvent(unsigned frame) {
>   long long frame_time = GetTimeFromFrames(frame);
>   unsigned long long current_time = GetMicroSeconds();
>   DequeueEvent(frame_time < current_time ? 0 : frame_time - current_time);
> }
>
> Aka:
>
> uint32_t a = (uint32_t)SAT_SUB(uint64_t, uint64_t);
>
> Then there will be a problem when ia32 or -m32 is given when compiling.
> Because we only check the lhs (aka uint32_t) type is supported by ifn
> and missed the operand (aka uint64_t).  Mostly DImode is disabled for
> 32 bits target like ia32 or rv32gcv, and then trigger ICE when expanding.
>
> The below test suites are passed for this patch.
> * The rv64gcv fully regression test.
> * The x86 bootstrap test.
> * The x86 fully regression test.
>
> PR target/116814

This is not "target", but "middle-end" component. Even though the bug
is exposed on x86_64 target, the fix is in the middle-end code, not in
the target code.

> gcc/ChangeLog:
>
> * tree-ssa-math-opts.cc (build_saturation_binary_arith_call): Add
> ifn is_supported check for operand TREE type.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/torture/pr116814-1.C: New test.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/testsuite/g++.dg/torture/pr116814-1.C | 12 
>  gcc/tree-ssa-math-opts.cc | 23 +++
>  2 files changed, 27 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/torture/pr116814-1.C
>
> diff --git a/gcc/testsuite/g++.dg/torture/pr116814-1.C 
> b/gcc/testsuite/g++.dg/torture/pr116814-1.C
> new file mode 100644
> index 000..8db5b020cfd
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/torture/pr116814-1.C
> @@ -0,0 +1,12 @@
> +/* { dg-do compile { target { i?86-*-* x86_64-*-* } } } */
> +/* { dg-options "-O2 -m32" } */

Please remove -m32 and use "{ dg-do compile { target ia32 } }" instead.

Uros,

> +
> +unsigned long long GetTimeFromFrames(int);
> +unsigned long long GetMicroSeconds();
> +
> +void DequeueEvent(unsigned frame) {
> +  long long frame_time = GetTimeFromFrames(frame);
> +  unsigned long long current_time = GetMicroSeconds();
> +
> +  DequeueEvent(frame_time < current_time ? 0 : frame_time - current_time);
> +}
> diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
> index d61668aacfc..361761cedef 100644
> --- a/gcc/tree-ssa-math-opts.cc
> +++ b/gcc/tree-ssa-math-opts.cc
> @@ -4042,15 +4042,22 @@ build_saturation_binary_arith_call 
> (gimple_stmt_iterator *gsi, gphi *phi,
> internal_fn fn, tree lhs, tree op_0,
> tree op_1)
>  {
> -  if (direct_internal_fn_supported_p (fn, TREE_TYPE (lhs), 
> OPTIMIZE_FOR_BOTH))
> -{
> -  gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> -  gimple_call_set_lhs (call, lhs);
> -  gsi_insert_before (gsi, call, GSI_SAME_STMT);
> +  tree lhs_type = TREE_TYPE (lhs);
> +  tree op_type = TREE_TYPE (op_0);
>
> -  gimple_stmt_iterator psi = gsi_for_stmt (phi);
> -  remove_phi_node (&psi, /* release_lhs_p */ false);
> -}
> +  if (!direct_internal_fn_supported_p (fn, lhs_type, OPTIMIZE_FOR_BOTH))
> +return;
> +
> +  if (lhs_type != op_type
> +  && !direct_internal_fn_supported_p (fn, op_type, OPTIMIZE_FOR_BOTH))
> +return;
> +
> +  gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  gimple_call_set_

Re: [PATCH] i386: Fix up _mm_min_ss etc. handling of zeros and NaNs [PR116738]

2024-09-19 Thread Uros Bizjak

On Thu, Sep 19, 2024 at 10:49 PM Jakub Jelinek  wrote:
>
> Hi!
>
> min/max patterns for intrinsics which on x86 result in the second
> input operand if the two operands are both zeros or one or both of them
> are a NaN shouldn't use SMIN/SMAX RTL, because that is similarly to
> MIN_EXPR/MAX_EXPR undefined what will be the result in those cases.
>
> The following patch adds an expander which uses either a new pattern with
> UNSPEC_IEEE_M{AX,IN} or use the S{MIN,MAX} representation of the same.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> P.S. I have a patch to replace UNSPEC_IEEE_M{AX,IN} with IF_THEN_ELSE
> (except for the 3dNOW! PFMIN/MAX, those actually are documented to behave
> differently), but it actually doesn't improve anything much, as
> simplify_const_relational_operation nor simplify_ternary_operation aren't
> able to fold comparisons with two CONST_VECTOR operands or IF_THEN_ELSE
> with 3 CONST_VECTOR operands.
> So, maybe better approach will be to generic fold the builtins with constant
> arguments (maybe leaving NaNs to runtime).

I think it is still worth it to implement insn patterns with generic
RTXes instead of unspecs. Maybe some future improvement to generic RTX
simplification will be able to handle them.

>
> 2024-09-19  Uros Bizjak  
> Jakub Jelinek  
>
> PR target/116738
> * config/i386/subst.md (mask_scalar_operand_arg34,
> mask_scalar_expand_op3, round_saeonly_scalar_mask_arg3): New
> subst attributes.
> * config/i386/sse.md
> (_vm3):
> Change from define_insn to define_expand, rename the old define_insn
> to ...
> (*_vm3):
> ... this.
> 
> (_ieee_vm3):
> New define_insn.
>
> * gcc.target/i386/sse-pr116738.c: New test.

OK, also for backports.

Thanks,
Uros.

>
> --- gcc/config/i386/subst.md.jj 2024-09-18 15:49:42.200791315 +0200
> +++ gcc/config/i386/subst.md2024-09-19 12:32:51.048626421 +0200
> @@ -366,6 +366,8 @@ (define_subst_attr "mask_scalar_operand4
>  (define_subst_attr "mask_scalarcz_operand4" "mask_scalarcz" "" "%{%5%}%N4")
>  (define_subst_attr "mask_scalar4_dest_false_dep_for_glc_cond" "mask_scalar" 
> "1" "operands[4] == CONST0_RTX(mode)")
>  (define_subst_attr "mask_scalarc_dest_false_dep_for_glc_cond" "mask_scalarc" 
> "1" "operands[3] == CONST0_RTX(V8HFmode)")
> +(define_subst_attr "mask_scalar_operand_arg34" "mask_scalar" "" ", 
> operands[3], operands[4]")
> +(define_subst_attr "mask_scalar_expand_op3" "mask_scalar" "3" "5")
>
>  (define_subst "mask_scalar"
>[(set (match_operand:SUBST_V 0)
> @@ -473,6 +475,7 @@ (define_subst_attr "round_saeonly_scalar
>  (define_subst_attr "round_saeonly_scalar_constraint" "round_saeonly_scalar" 
> "vm" "v")
>  (define_subst_attr "round_saeonly_scalar_prefix" "round_saeonly_scalar" 
> "vex" "evex")
>  (define_subst_attr "round_saeonly_scalar_nimm_predicate" 
> "round_saeonly_scalar" "nonimmediate_operand" "register_operand")
> +(define_subst_attr "round_saeonly_scalar_mask_arg3" "round_saeonly_scalar" 
> "" ", operands[]")
>
>  (define_subst "round_saeonly_scalar"
>[(set (match_operand:SUBST_V 0)
> --- gcc/config/i386/sse.md.jj   2024-09-10 16:26:02.875151133 +0200
> +++ gcc/config/i386/sse.md  2024-09-19 12:43:31.693030695 +0200
> @@ -,7 +,27 @@ (define_insn "*ieee_3
>(const_string "*")))
> (set_attr "mode" "")])
>
> -(define_insn 
> "_vm3"
> +(define_expand 
> "_vm3"
> +  [(set (match_operand:VFH_128 0 "register_operand")
> +   (vec_merge:VFH_128
> + (smaxmin:VFH_128
> +   (match_operand:VFH_128 1 "register_operand")
> +   (match_operand:VFH_128 2 "nonimmediate_operand"))
> +(match_dup 1)
> +(const_int 1)))]
> +  "TARGET_SSE"
> +{
> +  if (!flag_finite_math_only || flag_signed_zeros)
> +{
> +  emit_insn 
> (gen__ieee_vm3
> +(operands[0], operands[1], operands[2]
> + 
> + ));
> +  DONE;
> +}
> +})
> +
> +(define_insn 
> "*_vm3"
>[(set (match_operand:VFH_128 0 "register_operand" "=x,v")
> (vec_merge:VFH_128
&g

Re: [PATCH] x86-64: Don't use temp for argument in a TImode register

2024-09-15 Thread Uros Bizjak

On Sat, Sep 14, 2024 at 12:58 PM H.J. Lu  wrote:
>
> On Sun, Sep 8, 2024 at 12:10 AM Uros Bizjak  wrote:
> >
> > On Fri, Sep 6, 2024 at 2:24 PM H.J. Lu  wrote:
> > >
> > > Don't use temp for a PARALLEL BLKmode argument of an EXPR_LIST expression
> > > in a TImode register.  Otherwise, the TImode variable will be put in
> > > the GPR save area which guarantees only 8-byte alignment.
> > >
> > > gcc/
> > >
> > > PR target/116621
> > > * config/i386/i386.cc (ix86_gimplify_va_arg): Don't use temp for
> > > a PARALLEL BLKmode container of an EXPR_LIST expression in a
> > > TImode register.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/116621
> > > * gcc.target/i386/pr116621.c: New test.
> >
> > LGTM.
>
> OK to backport to release branches?

OK.

Thanks,
Uros.

>
> > Thanks,
> > Uros.
> >
> > >
> > > Signed-off-by: H.J. Lu 
> > > ---
> > >  gcc/config/i386/i386.cc  | 22 ++--
> > >  gcc/testsuite/gcc.target/i386/pr116621.c | 43 
> > >  2 files changed, 63 insertions(+), 2 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr116621.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 707b75a6d5d..45320124b91 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -4908,13 +4908,31 @@ ix86_gimplify_va_arg (tree valist, tree type, 
> > > gimple_seq *pre_p,
> > >
> > >examine_argument (nat_mode, type, 0, &needed_intregs, 
> > > &needed_sseregs);
> > >
> > > -  need_temp = (!REG_P (container)
> > > +  bool container_in_reg = false;
> > > +  if (REG_P (container))
> > > +   container_in_reg = true;
> > > +  else if (GET_CODE (container) == PARALLEL
> > > +  && GET_MODE (container) == BLKmode
> > > +  && XVECLEN (container, 0) == 1)
> > > +   {
> > > + /* Check if it is a PARALLEL BLKmode container of an EXPR_LIST
> > > +expression in a TImode register.  In this case, temp isn't
> > > +needed.  Otherwise, the TImode variable will be put in the
> > > +GPR save area which guarantees only 8-byte alignment.   */
> > > + rtx x = XVECEXP (container, 0, 0);
> > > + if (GET_CODE (x) == EXPR_LIST
> > > + && REG_P (XEXP (x, 0))
> > > + && XEXP (x, 1) == const0_rtx)
> > > +   container_in_reg = true;
> > > +   }
> > > +
> > > +  need_temp = (!container_in_reg
> > >&& ((needed_intregs && TYPE_ALIGN (type) > 64)
> > >|| TYPE_ALIGN (type) > 128));
> > >
> > >/* In case we are passing structure, verify that it is consecutive 
> > > block
> > >   on the register save area.  If not we need to do moves.  */
> > > -  if (!need_temp && !REG_P (container))
> > > +  if (!need_temp && !container_in_reg)
> > > {
> > >   /* Verify that all registers are strictly consecutive  */
> > >   if (SSE_REGNO_P (REGNO (XEXP (XVECEXP (container, 0, 0), 0
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr116621.c 
> > > b/gcc/testsuite/gcc.target/i386/pr116621.c
> > > new file mode 100644
> > > index 000..704266458a8
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr116621.c
> > > @@ -0,0 +1,43 @@
> > > +/* { dg-do run } */
> > > +/* { dg-options "-O2" } */
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +union S8302
> > > +{
> > > +  union
> > > +  {
> > > +double b;
> > > +int c;
> > > +  } a;
> > > +  long double d;
> > > +  unsigned short int f[5];
> > > +};
> > > +
> > > +union S8302 s8302;
> > > +extern void check8302va (int i, ...);
> > > +
> > > +int
> > > +main (void)
> > > +{
> > > +  memset (&s8302, '\0', sizeof (s8302));
> > > +  s8302.a.b = -221438.25;
> > > +  check8302va (1, s8302);
> > > +  return 0;
> > > +}
> > > +
> > > +__attribute__((noinline, noclone))
> > > +void
> > > +check8302va (int z, ...)
> > > +{
> > > +  union S8302 arg, *p;
> > > +  va_list ap;
> > > +
> > > +  __builtin_va_start (ap, z);
> > > +  p = &s8302;
> > > +  arg = __builtin_va_arg (ap, union S8302);
> > > +  if (p->a.b != arg.a.b)
> > > +__builtin_abort ();
> > > +  __builtin_va_end (ap);
> > > +}
> > > --
> > > 2.46.0
> > >
>
>
>
> --
> H.J.

[committed] i386: Implement SAT_ADD for signed vector integers

2024-09-12 Thread Uros Bizjak

Enable V4QI, V2QI and V2HI mode signed saturated arithmetic insn patterns
and add a couple of testcases to test for PADDSB and PADDSW instructions.

PR target/112600

gcc/ChangeLog:

* config/i386/mmx.md (3): Rename
from *3.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112600-3a.c: New test.
* gcc.target/i386/pr112600-3b.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index 2f8d958dd5f..e88a06c441f 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -3218,7 +3218,7 @@ (define_insn "*mmx_3"
(set_attr "type" "mmxadd,sseadd,sseadd")
(set_attr "mode" "DI,TI,TI")])
 
-(define_insn "*3"
+(define_insn "3"
   [(set (match_operand:VI_16_32 0 "register_operand" "=x,Yw")
 (sat_plusminus:VI_16_32
  (match_operand:VI_16_32 1 "register_operand" "0,Yw")
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-3a.c 
b/gcc/testsuite/gcc.target/i386/pr112600-3a.c
new file mode 100644
index 000..0c38659643d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112600-3a.c
@@ -0,0 +1,25 @@
+/* PR middle-end/112600 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+#define MIN -128
+#define MAX 127
+
+typedef char T;
+typedef unsigned char UT;
+
+void foo (T *out, T *op_1, T *op_2, int n)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+{
+  T x = op_1[i];
+  T y = op_2[i];
+  T sum = (UT) x + (UT) y;
+
+  out[i] = (x ^ y) < 0 ? sum : (sum ^ x) >= 0 ? sum : x < 0 ? MIN : MAX;
+}
+}
+
+/* { dg-final { scan-assembler "paddsb" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-3b.c 
b/gcc/testsuite/gcc.target/i386/pr112600-3b.c
new file mode 100644
index 000..746c422ceb9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112600-3b.c
@@ -0,0 +1,25 @@
+/* PR middle-end/112600 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+#define MIN -32768
+#define MAX 32767
+
+typedef short T;
+typedef unsigned short UT;
+
+void foo (T *out, T *op_1, T *op_2, int n)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+{
+  T x = op_1[i];
+  T y = op_2[i];
+  T sum = (UT) x + (UT) y;
+
+  out[i] = (x ^ y) < 0 ? sum : (sum ^ x) >= 0 ? sum : x < 0 ? MIN : MAX;
+}
+}
+
+/* { dg-final { scan-assembler "paddsw" } } */

[committed]: i386: Use offsettable address constraint for double-word memory operands

2024-09-09 Thread Uros Bizjak

Double-word memory operands are accessed as their high and low parts, so the
memory location has to be offsettable.  Use "o" constraint instead of "m"
for double-word memory operands.

gcc/ChangeLog:

* config/i386/i386.md (*insvdi_lowpart_1): Use "o" constraint
instead of "m" for double-word mode memory operands.
(*add3_doubleword_zext): Ditto.
(*addv4_doubleword_1): Use "jO" constraint instead of "jM"
for double-word mode memory operands.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 0fae3c1eb87..8d269feee83 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3707,7 +3707,7 @@ (define_insn_and_split "*insvdi_lowpart_1"
   [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r,r,&r")
(any_or_plus:DI
  (and:DI
-   (match_operand:DI 1 "nonimmediate_operand" "r,m,r,m")
+   (match_operand:DI 1 "nonimmediate_operand" "r,o,r,o")
(match_operand:DI 3 "const_int_operand" "n,n,n,n"))
  (zero_extend:DI
(match_operand:SI 2 "nonimmediate_operand" "r,r,m,m"]
@@ -6461,7 +6461,7 @@ (define_insn_and_split "*add3_doubleword_zext"
(plus:
  (zero_extend:
(match_operand:DWIH 2 "nonimmediate_operand" "rm,r,rm,r"))
- (match_operand: 1 "nonimmediate_operand" "0,0,r,m")))
+ (match_operand: 1 "nonimmediate_operand" "0,0,r,o")))
(clobber (reg:CC FLAGS_REG))]
   "ix86_binary_operator_ok (UNKNOWN, mode, operands, TARGET_APX_NDD)"
   "#"
@@ -7703,7 +7703,7 @@ (define_insn_and_split "*addv4_doubleword_1"
(eq:CCO
  (plus:
(sign_extend:
- (match_operand: 1 "nonimmediate_operand" "%0,rjM"))
+ (match_operand: 1 "nonimmediate_operand" "%0,rjO"))
(match_operand: 3 "const_scalar_int_operand" "n,n"))
  (sign_extend:
(plus:

Re: [PATCH] x86-64: Don't use temp for argument in a TImode register

2024-09-08 Thread Uros Bizjak

On Fri, Sep 6, 2024 at 2:24 PM H.J. Lu  wrote:
>
> Don't use temp for a PARALLEL BLKmode argument of an EXPR_LIST expression
> in a TImode register.  Otherwise, the TImode variable will be put in
> the GPR save area which guarantees only 8-byte alignment.
>
> gcc/
>
> PR target/116621
> * config/i386/i386.cc (ix86_gimplify_va_arg): Don't use temp for
> a PARALLEL BLKmode container of an EXPR_LIST expression in a
> TImode register.
>
> gcc/testsuite/
>
> PR target/116621
> * gcc.target/i386/pr116621.c: New test.

LGTM.

Thanks,
Uros.

>
> Signed-off-by: H.J. Lu 
> ---
>  gcc/config/i386/i386.cc  | 22 ++--
>  gcc/testsuite/gcc.target/i386/pr116621.c | 43 
>  2 files changed, 63 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116621.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 707b75a6d5d..45320124b91 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -4908,13 +4908,31 @@ ix86_gimplify_va_arg (tree valist, tree type, 
> gimple_seq *pre_p,
>
>examine_argument (nat_mode, type, 0, &needed_intregs, &needed_sseregs);
>
> -  need_temp = (!REG_P (container)
> +  bool container_in_reg = false;
> +  if (REG_P (container))
> +   container_in_reg = true;
> +  else if (GET_CODE (container) == PARALLEL
> +  && GET_MODE (container) == BLKmode
> +  && XVECLEN (container, 0) == 1)
> +   {
> + /* Check if it is a PARALLEL BLKmode container of an EXPR_LIST
> +expression in a TImode register.  In this case, temp isn't
> +needed.  Otherwise, the TImode variable will be put in the
> +GPR save area which guarantees only 8-byte alignment.   */
> + rtx x = XVECEXP (container, 0, 0);
> + if (GET_CODE (x) == EXPR_LIST
> + && REG_P (XEXP (x, 0))
> + && XEXP (x, 1) == const0_rtx)
> +   container_in_reg = true;
> +   }
> +
> +  need_temp = (!container_in_reg
>&& ((needed_intregs && TYPE_ALIGN (type) > 64)
>|| TYPE_ALIGN (type) > 128));
>
>/* In case we are passing structure, verify that it is consecutive 
> block
>   on the register save area.  If not we need to do moves.  */
> -  if (!need_temp && !REG_P (container))
> +  if (!need_temp && !container_in_reg)
> {
>   /* Verify that all registers are strictly consecutive  */
>   if (SSE_REGNO_P (REGNO (XEXP (XVECEXP (container, 0, 0), 0
> diff --git a/gcc/testsuite/gcc.target/i386/pr116621.c 
> b/gcc/testsuite/gcc.target/i386/pr116621.c
> new file mode 100644
> index 000..704266458a8
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116621.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2" } */
> +
> +#include 
> +#include 
> +
> +union S8302
> +{
> +  union
> +  {
> +double b;
> +int c;
> +  } a;
> +  long double d;
> +  unsigned short int f[5];
> +};
> +
> +union S8302 s8302;
> +extern void check8302va (int i, ...);
> +
> +int
> +main (void)
> +{
> +  memset (&s8302, '\0', sizeof (s8302));
> +  s8302.a.b = -221438.25;
> +  check8302va (1, s8302);
> +  return 0;
> +}
> +
> +__attribute__((noinline, noclone))
> +void
> +check8302va (int z, ...)
> +{
> +  union S8302 arg, *p;
> +  va_list ap;
> +
> +  __builtin_va_start (ap, z);
> +  p = &s8302;
> +  arg = __builtin_va_arg (ap, union S8302);
> +  if (p->a.b != arg.a.b)
> +__builtin_abort ();
> +  __builtin_va_end (ap);
> +}
> --
> 2.46.0
>

Re: [x86_64 PATCH] Support read-modify-write memory operands in STV.

2024-08-31 Thread Uros Bizjak

On Sat, Aug 31, 2024 at 3:28 PM Roger Sayle  wrote:
>
>
> Hi Uros,
>
> As requested this patch is split out from my previous submission.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659450.html
> This patch enables STV when the first operand of a TImode binary
> logic operand (AND, IOR or XOR) is a memory operand, which is commonly
> the case with read-modify-write instructions.
>
> A different motivating example from the one given above is:
>
> __int128 m, p, q;
> void foo() {
> m ^= (p & q);
> }
>
> Currently with -O2 -mavx, RMW instructions are rejected by STV,
> resulting in scalar code:
>
> foo:movqp(%rip), %rax
> movqp+8(%rip), %rdx
> andqq(%rip), %rax
> andqq+8(%rip), %rdx
> xorq%rax, m(%rip)
> xorq%rdx, m+8(%rip)
> ret
>
> With this patch they become scalar-to-vector candidates:
>
> foo:vmovdqa p(%rip), %xmm0
> vpand   q(%rip), %xmm0, %xmm0
> vpxor   m(%rip), %xmm0, %xmm0
> vmovdqa %xmm0, m(%rip)
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-31  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc
> (timode_scalar_to_vector_candidate_p):
> Support the first operand of AND, IOR and XOR being MEM_P, i.e. a
> read-modify-write insn.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/movti-2.c: Change dg-options to -Os.
> * gcc.target/i386/movti-4.c: Expected output of original movti-2.c.

OK.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

Re: [PATCH] [x86] Check avx upper register for parallel.

2024-08-29 Thread Uros Bizjak

On Fri, Aug 30, 2024 at 6:49 AM liuhongt  wrote:
>
> > Can the above loop be a part of ix86_check_avx_upper_register, so this
> > function would scan the full RTX for avx upper register?
> Changed, also adjust ix86_check_avx_upper_stores and ix86_avx_u128_mode_needed
> to either inline the old ix86_check_avx_upper_register or replace 
> FOR_EACH_SUBRTX
> with new ix86_check_avx_upper_register.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk and backport?
>
> For function arguments/return, when it's BLK mode, it's put in a
> parallel with an expr_list, and the expr_list contains the real mode
> and registers.
> Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
> failed to handle that. The patch extend the handle to each subrtx.
>
> gcc/ChangeLog:
>
> PR target/116512
> * config/i386/i386.cc (ix86_check_avx_upper_register): Iterate
> subrtx to scan for avx upper register.
> (ix86_check_avx_upper_stores): Inline old
> ix86_check_avx_upper_register.
> (ix86_avx_u128_mode_needed): Ditto, and replace
> FOR_EACH_SUBRTX with call to new
> ix86_check_avx_upper_register.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116512.c: New test.

OK for all branches.

Perhaps we could put the repeated condition in a macro, but this could
be an eventual follow-up patch.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.cc  | 36 +++-
>  gcc/testsuite/gcc.target/i386/pr116512.c | 26 +
>  2 files changed, 49 insertions(+), 13 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116512.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 224a78cc832..c40cee5b885 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -14881,9 +14881,19 @@ ix86_dirflag_mode_needed (rtx_insn *insn)
>  static bool
>  ix86_check_avx_upper_register (const_rtx exp)
>  {
> -  return (SSE_REG_P (exp)
> - && !EXT_REX_SSE_REG_P (exp)
> - && GET_MODE_BITSIZE (GET_MODE (exp)) > 128);
> +  /* construct_container may return a parallel with expr_list
> + which contains the real reg and mode  */
> +  subrtx_iterator::array_type array;
> +  FOR_EACH_SUBRTX (iter, array, exp, NONCONST)
> +{
> +  const_rtx x = *iter;
> +  if (SSE_REG_P (x)
> + && !EXT_REX_SSE_REG_P (x)
> + && GET_MODE_BITSIZE (GET_MODE (x)) > 128)
> +   return true;
> +}
> +
> +  return false;
>  }
>
>  /* Check if a 256bit or 512bit AVX register is referenced in stores.   */
> @@ -14891,7 +14901,9 @@ ix86_check_avx_upper_register (const_rtx exp)
>  static void
>  ix86_check_avx_upper_stores (rtx dest, const_rtx, void *data)
>  {
> -  if (ix86_check_avx_upper_register (dest))
> +  if (SSE_REG_P (dest)
> +  && !EXT_REX_SSE_REG_P (dest)
> +  && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
>  {
>bool *used = (bool *) data;
>*used = true;
> @@ -14950,14 +14962,14 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
>return AVX_U128_CLEAN;
>  }
>
> -  subrtx_iterator::array_type array;
> -
>rtx set = single_set (insn);
>if (set)
>  {
>rtx dest = SET_DEST (set);
>rtx src = SET_SRC (set);
> -  if (ix86_check_avx_upper_register (dest))
> +  if (SSE_REG_P (dest)
> + && !EXT_REX_SSE_REG_P (dest)
> + && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
> {
>   /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
>  source isn't zero.  */
> @@ -14968,9 +14980,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
> }
>else
> {
> - FOR_EACH_SUBRTX (iter, array, src, NONCONST)
> -   if (ix86_check_avx_upper_register (*iter))
> - return AVX_U128_DIRTY;
> + if (ix86_check_avx_upper_register (src))
> +   return AVX_U128_DIRTY;
> }
>
>/* This isn't YMM/ZMM load/store.  */
> @@ -14981,9 +14992,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
>   Hardware changes state only when a 256bit register is written to,
>   but we need to prevent the compiler from moving optimal insertion
>   point above eventual read from 256bit or 512 bit register.  */
> -  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), NONCONST)
> -if (ix86_check_avx_upper_register (*iter))
> -  return AVX_U128_DIRTY;
> +  if (ix86_check_avx_upper_register (PATTERN (insn)))
> +return AVX_U128_DIRTY;
>
>return AVX_U128_ANY;
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/pr116512.c 
> b/gcc/testsuite/gcc.target/i386/pr116512.c
> new file mode 100644
> index 000..c2bc6c91b64
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116512.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=x86-64-v4 -O2" } */
> +/* { dg-final { scan-assembler-not "vzeroupper" { target { ! ia32 } } } } */
> +
> +#include 
> +
> +struct B {
> +  union {
> +

Re: [PATCH] [x86] Check avx upper register for parallel.

2024-08-29 Thread Uros Bizjak

On Thu, Aug 29, 2024 at 9:33 AM liuhongt  wrote:
>
> For function arguments/return, when it's BLK mode, it's put in a
> parallel with an expr_list, and the expr_list contains the real mode
> and registers.
> Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
> failed to handle that. The patch extend the handle to each subrtx.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/116512
> * config/i386/i386.cc (ix86_avx_u128_mode_entry): Iterate
> each subrtx for potential rtx parallel to check avx upper
> register.
> (ix86_avx_u128_mode_exit): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116512.c: New test.
> ---
>  gcc/config/i386/i386.cc  | 28 
>  gcc/testsuite/gcc.target/i386/pr116512.c | 26 ++
>  2 files changed, 50 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116512.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 224a78cc832..94d1a14056e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -15148,8 +15148,18 @@ ix86_avx_u128_mode_entry (void)
>  {
>rtx incoming = DECL_INCOMING_RTL (arg);
>
> -  if (incoming && ix86_check_avx_upper_register (incoming))
> -   return AVX_U128_DIRTY;
> +  if (incoming)
> +   {
> + /* construct_container may return a parallel with expr_list
> +which contains the real reg and mode  */
> + subrtx_var_iterator::array_type array;
> + FOR_EACH_SUBRTX_VAR (iter, array, incoming, ALL)
> +   {
> + rtx x = *iter;
> + if (ix86_check_avx_upper_register (x))
> +   return AVX_U128_DIRTY;
> +   }
> +   }
>  }

Can the above loop be a part of ix86_check_avx_upper_register, so this
function would scan the full RTX for avx upper register?

Uros,

>return AVX_U128_CLEAN;
> @@ -15184,8 +15194,18 @@ ix86_avx_u128_mode_exit (void)
>
>/* Exit mode is set to AVX_U128_DIRTY if there are 256bit
>   or 512 bit modes used in the function return register. */
> -  if (reg && ix86_check_avx_upper_register (reg))
> -return AVX_U128_DIRTY;
> +  if (reg)
> +{
> +  /* construct_container may return a parallel with expr_list
> +which contains the real reg and mode  */
> +  subrtx_var_iterator::array_type array;
> +  FOR_EACH_SUBRTX_VAR (iter, array, reg, ALL)
> +   {
> + rtx x = *iter;
> + if (ix86_check_avx_upper_register (x))
> +   return AVX_U128_DIRTY;
> +   }
> +}
>
>/* Exit mode is set to AVX_U128_DIRTY if there are 256bit or 512bit
>   modes used in function arguments, otherwise return AVX_U128_CLEAN.
> diff --git a/gcc/testsuite/gcc.target/i386/pr116512.c 
> b/gcc/testsuite/gcc.target/i386/pr116512.c
> new file mode 100644
> index 000..c2bc6c91b64
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116512.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=x86-64-v4 -O2" } */
> +/* { dg-final { scan-assembler-not "vzeroupper" { target { ! ia32 } } } } */
> +
> +#include 
> +
> +struct B {
> +  union {
> +__m512 f;
> +__m512i s;
> +  };
> +};
> +
> +struct B foo(int n) {
> +  struct B res;
> +  res.s = _mm512_set1_epi32(n);
> +
> +  return res;
> +}
> +
> +__m512i bar(int n) {
> +  struct B res;
> +  res.s = _mm512_set1_epi32(n);
> +
> +  return res.s;
> +}
> --
> 2.31.1
>

Re: [x86_64 PATCH] Update STV's gains for TImode arithmetic right shifts on AVX2.

2024-08-25 Thread Uros Bizjak

V sob., 24. avg. 2024 17:11 je oseba Roger Sayle 
napisala:

>
> This patch tweaks timode_scalar_chain::compute_convert_gain to better
> reflect the expansion of V1TImode arithmetic right shifts by the i386
> backend.  The comment "see ix86_expand_v1ti_ashiftrt" appears after
> "case ASHIFTRT" in compute_convert_gain, and the changes below attempt
> to better match the logic used there.
>
> The original motivating example is:
>
> __int128 m1;
> void foo()
> {
>   m1 = (m1 << 8) >> 8;
> }
>
> which with -O2 -mavx2 we fail to convert to vector form due to the
> inappropriate cost of the arithmetic right shift.
>
>   Instruction gain -16 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
>   Total gain: -3
>   Chain #1 conversion is not profitable
>
> This is reporting that the ASHIFTRT is four instructions worse using
> vectors than in scalar form, which is incorrect as the AVX2 expansion
> of this shift only requires three instructions (and the scalar form
> requires two).
>
> With more accurate costs in timode_scalar_chain::compute_convert_gain
> we now see (with -O2 -mavx2):
>
>   Instruction gain -4 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
>   Total gain: 9
>   Converting chain #1...
>
> which results in:
>
> foo:vmovdqa m1(%rip), %xmm0
> vpslldq $1, %xmm0, %xmm0
> vpsrad  $8, %xmm0, %xmm1
> vpsrldq $1, %xmm0, %xmm0
> vpblendd$7, %xmm0, %xmm1, %xmm0
> vmovdqa %xmm0, m1(%rip)
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  No new testcase (yet) as the code for both the
> vector and scalar forms of the above function are still suboptimal
> so code generation is in flux, but this improvement should be a step
> in the right direction.  Ok for mainline?
>
>
> 2024-08-24  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (compute_convert_gain)
> : Update to match ix86_expand_v1ti_ashiftrt.
>
> TARGET_AVX2 always implies TARGET_SSE4_1, so there is no need to OR them
> together.
>

OK with above change.

Thanks,
Uros.

>
>

Re: [PATCH] testsuite: i386: Fix g++.target/i386/pr116275-2.C on Solaris/x86

2024-08-20 Thread Uros Bizjak

On Tue, Aug 20, 2024 at 3:06 PM Rainer Orth  
wrote:
>
> The new g++.target/i386/pr116275-2.C test FAILs on 32-bit Solaris/x86:
>
> FAIL: g++.target/i386/pr116275-2.C   scan-assembler vpslld
>
> This happens because Solaris defaults to -mstackrealign, disabling -mstv.
>
> Fixed by disabling the former and enabling the latter.
>
> Tested on i386-pc-solaris2.11 and x86_64-pc-linux-gnu.
>
> Ok for trunk?

OK.

Thanks,
Uros.

> Rainer
>
> --
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University
>
>
> 2024-08-20  Rainer Orth  
>
> gcc/testsuite:
> * g++.target/i386/pr116275-2.C (dg-options): Add -mstv
> -mno-stackrealign.
>

Re: [PATCH] Align predicates for operands[1] between mov and *mov_internal.

2024-08-20 Thread Uros Bizjak

On Tue, Aug 20, 2024 at 12:25 PM liuhongt  wrote:
>
> From [1]
> > > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > > source, especially since ix86_expand_vector_mode does have code to
> > > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > > predicates anyway, so the predicate will have little effect.
> > >
> > > A workaround would be to check legitimate_constant_p instead of the
> > > predicate, but I'm not sure that that should be necessary.
> > >
> > > Has this already been discussed?  If not, we should loop in the x86
> > > maintainers (but I didn't do that here in case it would be a repeat).
> >
> > I also noticed it. Not sure why movv16qi requires a
> > nonimmediate_operand, while ix86_expand_vector_mode could deal with
> > constant op. Looking forward to Hongtao's comments.
> The code has been there since 2005 before I'm involved.
>  It looks to me at the beginning both mov and
> *mov_internal only support nonimmediate_operand for the
> operands[1].
> And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
> nonimmediate_or_sse_const_operand for *mov_internal, but not for
> mov. I think we can align the predicate between mov
> and *mov_internal.

It looks to me that ix86_expand_vector_move correctly handles
standard_sse_constant_p operands.

> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (mov): Align predicates for
> operands[1] between mov and *mov_internal.

OK, but please also change mov expander in mmx.md

> ---
>  gcc/config/i386/sse.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index d1010bc5682..7ecfbd55809 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1387,7 +1387,7 @@ (define_mode_attr DOUBLEMASKMODE
>
>  (define_expand "mov"
>[(set (match_operand:VMOVE 0 "nonimmediate_operand")
> -   (match_operand:VMOVE 1 "nonimmediate_operand"))]
> +   (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"))]
>"TARGET_SSE"
>  {
>ix86_expand_vector_move (mode, operands);

Please also change expander in mmx.md to use nonimm_or_0 predicate
(for some reason -1 is not handled here).

Thanks,
Uros.

Re: [PATCH v2] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-15 Thread Uros Bizjak

On Thu, Aug 15, 2024 at 9:27 AM liuhongt  wrote:
>
> It results in 2 failures for x86_64-pc-linux-gnu{\
> -march=cascadelake};
>
> gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
> gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
>
> For pr113560.c, now GCC generates mulx instead of mulq with
> -march=cascadelake, which should be optimal, so adjust testcase for
> that.
> For gcc.target/i386/extendditi2-1.c, RA happens to choose another
> register instead of rax and result in
>
> movq%rdi, %rbp
> movq%rdi, %rax
> sarq$63, %rbp
> movq%rbp, %rdx
>
> The patch adds a new define_peephole2 for that.
>
> gcc/ChangeLog:
>
> PR target/116274
> * config/i386/i386-expand.cc (ix86_expand_vector_move):
> Restrict special case TImode to 128-bit vector conversions via
> V2DI under ix86_pre_reload_split ().
> * config/i386/i386.cc (inline_secondary_memory_needed):
> Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
> need secondary reload.
> * config/i386/i386.md (*extendsidi2_rex64): Add a
> define_peephole2 after it.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116274.c: New test.
> * gcc.target/i386/pr113560.c: Scan either mulq or mulx.

OK, with updated comment, as proposed below.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.cc   |  2 +-
>  gcc/config/i386/i386.cc  | 18 --
>  gcc/config/i386/i386.md  | 19 +++
>  gcc/testsuite/gcc.target/i386/pr113560.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116274.c | 12 
>  5 files changed, 45 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index bdbc1423267..ed546eeed6b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -751,7 +751,7 @@ ix86_expand_vector_move (machine_mode mode, rtx 
> operands[])
>&& SUBREG_P (op1)
>&& GET_MODE (SUBREG_REG (op1)) == TImode
>&& TARGET_64BIT && TARGET_SSE
> -  && can_create_pseudo_p ())
> +  && ix86_pre_reload_split ())
>  {
>rtx tmp = gen_reg_rtx (V2DImode);
>rtx lo = gen_reg_rtx (DImode);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index f044826269c..4821892d1e0 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20292,6 +20292,18 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
> return true;
>
> +  /* If the target says that inter-unit moves are more expensive
> +than moving through memory, then don't generate them.  */
> +  if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
> + || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
> +   return true;
> +
> +  /* Under SSE4.1, *movti_internal supports movement between
> +SSE_REGS and GENERAL_REGS with pinsrq and pextrq.  */

WIth SSE4.1, *mov{ti,di}_internal supports moves between
SSE_REGS and GENERAL_REGS using pinsr{q,d} or pextr{q,d}.

> +  if (TARGET_SSE4_1
> + && (TARGET_64BIT ? mode == TImode : mode == DImode))
> +   return false;
> +
>int msize = GET_MODE_SIZE (mode);
>
>/* Between SSE and general, we have moves no larger than word size.  */
> @@ -20304,12 +20316,6 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>
>if (msize < minsize)
> return true;
> -
> -  /* If the target says that inter-unit moves are more expensive
> -than moving through memory, then don't generate them.  */
> -  if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
> - || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
> -   return true;
>  }
>
>return false;
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index db7789c17d2..1962a7ba5c9 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5041,6 +5041,25 @@ (define_split
>DONE;
>  })
>
> +(define_peephole2
> +  [(set (match_operand:DI 0 "general_reg_operand")
> +   (match_operand:DI 1 "general_reg_operand"))
> +   (parallel [(set (match_dup 0)
> +  (ashiftrt:DI (match_dup 0)
> +   (const_int 63)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (set (match_operand:DI 2 "general_reg_operand") (match_dup 1))
> +   (set (match_operand:DI 3 "general_reg_operand") (match_dup 0))]
> +  "(optimize_function_for_size_p (cfun) || TARGET_USE_CLTD)
> +   && REGNO (operands[2]) == AX_REG
> +   && REGNO (operands[3]) == DX_REG
> +   && peep2_reg_dead_p (4, operands[0])
> +   && !reg_mentioned_p (operands[0], operands[1])
> +   && !reg_mentioned_p (operan

Re: [x86_64 PATCH] Support wide immediate constants in STV.

2024-08-15 Thread Uros Bizjak

On Thu, Aug 15, 2024 at 11:34 AM Roger Sayle  wrote:
>
>
> As requested this patch is split out from my earlier submission.
> This patch provides more accurate costs/gains for (wide) immediate
> constants in STV, suitably adjusting the costs/gains when the highpart
> and lowpart words are the same.  One minor complication is that the
> middle-end assumes (when generating memset) that SSE constants will
> be shared/amortized across multiple consecutive writes.  Hence to
> avoid testsuite regressions, we add a heuristic that considers an immediate
> constant to be very cheap, if that same immediate value occurs in the
> previous instruction or in the following instruction.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-15  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (timode_immed_const_gain): New
> function to determine the gain/cost on a CONST_WIDE_INT.
> (local_duplicate_constant_p): Helper function to see if the
> same immediate constant appears in the previous or next insn.
> (timode_scalar_chain::compute_convert_gain): Fix whitespace.
> : Provide more accurate estimates using
> timode_immed_const_gain and local_duplicate_constant_p.
> : Handle CONSTANT_SCALAR_INT_P (src).

LGTM.

Thanks,
Uros.

Re: [x86 PATCH] Improve split of *extendv2di2_highpart_stv_noavx512vl.

2024-08-15 Thread Uros Bizjak

On Thu, Aug 15, 2024 at 11:14 AM Roger Sayle  wrote:
>
>
> This patch follows up on the previous patch to fix PR target/116275 by
> improving the code STV (ultimately) generates for highpart sign extensions
> like (x<<8)>>8.  The arithmetic right shift is able to take advantage of
> the available common subexpressions from the preceding left shift.
>
> Hence previously with -O2 -m32 -mavx -mno-avx512vl we'd generate:
>
> vpsllq  $8, %xmm0, %xmm0
> vpsrad  $8, %xmm0, %xmm1
> vpsrlq  $8, %xmm0, %xmm0
> vpblendw$51, %xmm0, %xmm1, %xmm0
>
> But with improved splitting, we now generate three instructions:
>
> vpslld  $8, %xmm1, %xmm0
> vpsrad  $8, %xmm0, %xmm0
> vpblendw$51, %xmm1, %xmm0, %xmm0
>
> This patch also implements Uros' suggestion that the pre-reload
> splitter could introduced a new pseudo to hold the intermediate
> to potentially help reload with register allocation, which applies
> when not performing the above optimization, i.e. on TARGET_XOP.
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-15  Roger Sayle  
> Uros Bizjak  
>
> gcc/ChangeLog
> * config/i386/i386.md (*extendv2di2_highpart_stv_noavx512vl): Split
> to an improved implementation on !TARGET_XOP.  On TARGET_XOP, use
> a new pseudo for the intermediate to simplify register allocation.
>
> gcc/testsuite/ChangeLog
> * g++.target/i386/pr116275-2.C: New test case.

LGTM.

Thanks,
Uros.

Re: [PATCH] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-13 Thread Uros Bizjak

On Wed, Aug 14, 2024 at 3:28 AM liuhongt  wrote:
>
> It results in 2 failures for x86_64-pc-linux-gnu{\
> -march=cascadelake};
>
> gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
> gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
>
> For pr113560.c, now GCC generates mulx instead of mulq with
> -march=cascadelake, which should be optimal, so adjust testcase for
> that.
> For gcc.target/i386/extendditi2-1.c, RA happens to choose another
> register instead of rax and result in
>
> movq%rdi, %rbp
> movq%rdi, %rax
> sarq$63, %rbp
> movq%rbp, %rdx
>
> The patch adds a new define_peephole2 for that.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/116274
> * config/i386/i386-expand.cc (ix86_expand_vector_move):
> Restrict special case TImode to 128-bit vector conversions via
> V2DI under ix86_pre_reload_split ().
> * config/i386/i386.cc (inline_secondary_memory_needed):
> Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
> need secondary reload.
> * config/i386/i386.md (*extendsidi2_rex64): Add a
> define_peephole2 after it.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116274.c: New test.
> * gcc.target/i386/pr113560.c: Scan either mulq or mulx.
> ---
>  gcc/config/i386/i386-expand.cc   |  2 +-
>  gcc/config/i386/i386.cc  |  5 +
>  gcc/config/i386/i386.md  | 19 +++
>  gcc/testsuite/gcc.target/i386/pr113560.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116274.c | 12 
>  5 files changed, 38 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index bdbc1423267..ed546eeed6b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -751,7 +751,7 @@ ix86_expand_vector_move (machine_mode mode, rtx 
> operands[])
>&& SUBREG_P (op1)
>&& GET_MODE (SUBREG_REG (op1)) == TImode
>&& TARGET_64BIT && TARGET_SSE
> -  && can_create_pseudo_p ())
> +  && ix86_pre_reload_split ())
>  {
>rtx tmp = gen_reg_rtx (V2DImode);
>rtx lo = gen_reg_rtx (DImode);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index f044826269c..31fe8a199c9 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20292,6 +20292,11 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
> return true;
>
> +  /* Under SSE4.1, *movti_internal supports movement between
> +SSE_REGS and GENERAL_REGS with pinsrq and pextrq.  */
> +  if (mode == TImode && TARGET_SSE4_1)
> +   return false;

Oh, please also account for TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC here.

BTW: Should we also consider DImode for x86_32 with TARGET_SSE4_1?
*movdi_internal can also move DImode values via pinsrd/pextrd.

Uros.

> +
>int msize = GET_MODE_SIZE (mode);
>
>/* Between SSE and general, we have moves no larger than word size.  */
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index db7789c17d2..1962a7ba5c9 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5041,6 +5041,25 @@ (define_split
>DONE;
>  })
>
> +(define_peephole2
> +  [(set (match_operand:DI 0 "general_reg_operand")
> +   (match_operand:DI 1 "general_reg_operand"))
> +   (parallel [(set (match_dup 0)
> +  (ashiftrt:DI (match_dup 0)
> +   (const_int 63)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (set (match_operand:DI 2 "general_reg_operand") (match_dup 1))
> +   (set (match_operand:DI 3 "general_reg_operand") (match_dup 0))]
> +  "(optimize_function_for_size_p (cfun) || TARGET_USE_CLTD)
> +   && REGNO (operands[2]) == AX_REG
> +   && REGNO (operands[3]) == DX_REG
> +   && peep2_reg_dead_p (4, operands[0])
> +   && !reg_mentioned_p (operands[0], operands[1])
> +   && !reg_mentioned_p (operands[2], operands[0])"
> +  [(set (match_dup 2) (match_dup 1))
> +   (parallel [(set (match_dup 3) (ashiftrt:DI (match_dup 2) (const_int 63)))
> + (clobber (reg:CC FLAGS_REG))])])
> +
>  (define_insn "extenddi2"
>[(set (match_operand:DI 0 "register_operand" "=r")
> (sign_extend:DI
> diff --git a/gcc/testsuite/gcc.target/i386/pr113560.c 
> b/gcc/testsuite/gcc.target/i386/pr113560.c
> index ac2e01a4589..9431a2d1d90 100644
> --- a/gcc/testsuite/gcc.target/i386/pr113560.c
> +++ b/gcc/testsuite/gcc.target/i386/pr113560.c
> @@ -11,7 +11,7 @@ __int128 bar(__int128 x, __int128 y)
>return (x & 1000) * (y & 1000);
>  }
>
> -/* { dg-final { scan-assembler-times "\tmulq" 1 } } */
> +/* { dg-final { scan-assembler-times "\tmul\[qx\]" 1 } }

Re: [PATCH] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-13 Thread Uros Bizjak

On Wed, Aug 14, 2024 at 3:28 AM liuhongt  wrote:
>
> It results in 2 failures for x86_64-pc-linux-gnu{\
> -march=cascadelake};
>
> gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
> gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
>
> For pr113560.c, now GCC generates mulx instead of mulq with
> -march=cascadelake, which should be optimal, so adjust testcase for
> that.
> For gcc.target/i386/extendditi2-1.c, RA happens to choose another
> register instead of rax and result in
>
> movq%rdi, %rbp
> movq%rdi, %rax
> sarq$63, %rbp
> movq%rbp, %rdx
>
> The patch adds a new define_peephole2 for that.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/116274
> * config/i386/i386-expand.cc (ix86_expand_vector_move):
> Restrict special case TImode to 128-bit vector conversions via
> V2DI under ix86_pre_reload_split ().
> * config/i386/i386.cc (inline_secondary_memory_needed):
> Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
> need secondary reload.
> * config/i386/i386.md (*extendsidi2_rex64): Add a
> define_peephole2 after it.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116274.c: New test.
> * gcc.target/i386/pr113560.c: Scan either mulq or mulx.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.cc   |  2 +-
>  gcc/config/i386/i386.cc  |  5 +
>  gcc/config/i386/i386.md  | 19 +++
>  gcc/testsuite/gcc.target/i386/pr113560.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116274.c | 12 
>  5 files changed, 38 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index bdbc1423267..ed546eeed6b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -751,7 +751,7 @@ ix86_expand_vector_move (machine_mode mode, rtx 
> operands[])
>&& SUBREG_P (op1)
>&& GET_MODE (SUBREG_REG (op1)) == TImode
>&& TARGET_64BIT && TARGET_SSE
> -  && can_create_pseudo_p ())
> +  && ix86_pre_reload_split ())
>  {
>rtx tmp = gen_reg_rtx (V2DImode);
>rtx lo = gen_reg_rtx (DImode);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index f044826269c..31fe8a199c9 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20292,6 +20292,11 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
> return true;
>
> +  /* Under SSE4.1, *movti_internal supports movement between
> +SSE_REGS and GENERAL_REGS with pinsrq and pextrq.  */
> +  if (mode == TImode && TARGET_SSE4_1)
> +   return false;
> +
>int msize = GET_MODE_SIZE (mode);
>
>/* Between SSE and general, we have moves no larger than word size.  */
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index db7789c17d2..1962a7ba5c9 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5041,6 +5041,25 @@ (define_split
>DONE;
>  })
>
> +(define_peephole2
> +  [(set (match_operand:DI 0 "general_reg_operand")
> +   (match_operand:DI 1 "general_reg_operand"))
> +   (parallel [(set (match_dup 0)
> +  (ashiftrt:DI (match_dup 0)
> +   (const_int 63)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (set (match_operand:DI 2 "general_reg_operand") (match_dup 1))
> +   (set (match_operand:DI 3 "general_reg_operand") (match_dup 0))]
> +  "(optimize_function_for_size_p (cfun) || TARGET_USE_CLTD)
> +   && REGNO (operands[2]) == AX_REG
> +   && REGNO (operands[3]) == DX_REG
> +   && peep2_reg_dead_p (4, operands[0])
> +   && !reg_mentioned_p (operands[0], operands[1])
> +   && !reg_mentioned_p (operands[2], operands[0])"
> +  [(set (match_dup 2) (match_dup 1))
> +   (parallel [(set (match_dup 3) (ashiftrt:DI (match_dup 2) (const_int 63)))
> + (clobber (reg:CC FLAGS_REG))])])
> +
>  (define_insn "extenddi2"
>[(set (match_operand:DI 0 "register_operand" "=r")
> (sign_extend:DI
> diff --git a/gcc/testsuite/gcc.target/i386/pr113560.c 
> b/gcc/testsuite/gcc.target/i386/pr113560.c
> index ac2e01a4589..9431a2d1d90 100644
> --- a/gcc/testsuite/gcc.target/i386/pr113560.c
> +++ b/gcc/testsuite/gcc.target/i386/pr113560.c
> @@ -11,7 +11,7 @@ __int128 bar(__int128 x, __int128 y)
>return (x & 1000) * (y & 1000);
>  }
>
> -/* { dg-final { scan-assembler-times "\tmulq" 1 } } */
> +/* { dg-final { scan-assembler-times "\tmul\[qx\]" 1 } } */
>  /* { dg-final { scan-assembler-times "\timulq" 1 } } */
>  /* { dg-final { scan-assembler-not "addq" } } */
>  /* { dg-final { scan-assembler-not "xorl" } } */
> diff --git a/gcc/tests

Re: [x86 PATCH] PR target/116275: Handle STV of *extenddi2_doubleword_highpart

2024-08-11 Thread Uros Bizjak

On Sun, Aug 11, 2024 at 12:16 PM Roger Sayle  wrote:
>
>
> This patch resolves PR target/116275, a recent ICE-on-valid regression on
> -m32 caused by my recent change to enable STV of DImode arithmeric right
> shift on non-AVX512VL targets.  The oversight is that the i386 backend
> contains an *extenddi2_doubleword_highpart instruction (whose pattern
> is an arithmetic right shift of a left shift) that optimizes the case where
> sign-extension need only update the highpart word of a DImode value when
> generating 32-bit code (!TARGET_64BIT).  STV accepts this pattern as a
> candidate, as there are patterns to handle this form of extension on SSE
> using AVX512VL instructions (and previously ASHIFTRT was only allowed on
> AVX512VL).  Now that ASHIFTRT is a candidate on non-AVX512vL targets, we
> either need to check that the first operand is a register, or as done
> below provide the define_insn_and_split that provides a non-AVX512VL
> implementation of *extendv2di_highpart_stv.
>
> The new testcase only ICEed with -m32, so this test could be limited to
> target ia32, but there's no harm also running this test on -m64 to
> provide a little extra test coverage.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-11  Roger Sayle  
>
> gcc/ChangeLog
> PR target/116275
> * config/i386/i386.md (*extendv2di2_highpart_stv_noavx512vl): New
> define_insn_and_split to handle the STV conversion of the DImode
> pattern *extenddi2_doubleword_highpart.
>
> gcc/testsuite/ChangeLog
> PR target/116275
> * g++.target/i386/pr116275.C: New test case.

+  [(set (match_dup 0)
+ (ashift:V2DI (match_dup 1) (match_dup 2)))
+   (set (match_dup 0)
+ (ashiftrt:V2DI (match_dup 0) (match_dup 2)))])

SInce this pattern is split before reload, you can perhaps introduce a
new V2DI temporary register and use it to output from the first RTX.
This will ease the job of RA a tiny bit.

OK with or without the above suggestion.

Thanks,
Uros.

Re: [PATCH] i386: Fix up __builtin_ia32_b{extr{, i}_u{32, 64}, zhi_{s, d}i} folding [PR116287]

2024-08-09 Thread Uros Bizjak

On Fri, Aug 9, 2024 at 9:29 AM Jakub Jelinek  wrote:
>
> Hi!
>
> The GENERIC folding of these builtins have cases where it folds to a
> constant regardless of the value of the first operand.  If so, we need
> to use omit_one_operand to avoid throwing away side-effects in the first
> operand if any.  The cases which verify the first argument is INTEGER_CST
> don't need that, INTEGER_CST doesn't have side-effects.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2024-08-09  Jakub Jelinek  
>
> PR target/116287
> * config/i386/i386.cc (ix86_fold_builtin) :
> When folding into zero without checking whether first argument is
> constant, use omit_one_operand.
> (ix86_fold_builtin) : Likewise.
>
> * gcc.target/i386/bmi-pr116287.c: New test.
> * gcc.target/i386/bmi2-pr116287.c: New test.
> * gcc.target/i386/tbm-pr116287.c: New test.

Rubberstamp OK.

Thanks,
Uros.

>
> --- gcc/config/i386/i386.cc.jj  2024-08-01 14:33:28.172801480 +0200
> +++ gcc/config/i386/i386.cc 2024-08-08 12:55:12.780696418 +0200
> @@ -18549,9 +18549,11 @@ ix86_fold_builtin (tree fndecl, int n_ar
>   unsigned int prec = TYPE_PRECISION (TREE_TYPE (args[0]));
>   unsigned int start = tree_to_uhwi (args[1]);
>   unsigned int len = (start & 0xff00) >> 8;
> + tree lhs_type = TREE_TYPE (TREE_TYPE (fndecl));
>   start &= 0xff;
>   if (start >= prec || len == 0)
> -   res = 0;
> +   return omit_one_operand (lhs_type, build_zero_cst (lhs_type),
> +args[0]);
>   else if (!tree_fits_uhwi_p (args[0]))
> break;
>   else
> @@ -18560,7 +18562,7 @@ ix86_fold_builtin (tree fndecl, int n_ar
> len = prec;
>   if (len < HOST_BITS_PER_WIDE_INT)
> res &= (HOST_WIDE_INT_1U << len) - 1;
> - return build_int_cstu (TREE_TYPE (TREE_TYPE (fndecl)), res);
> + return build_int_cstu (lhs_type, res);
> }
>   break;
>
> @@ -18570,15 +18572,17 @@ ix86_fold_builtin (tree fndecl, int n_ar
>   if (tree_fits_uhwi_p (args[1]))
> {
>   unsigned int idx = tree_to_uhwi (args[1]) & 0xff;
> + tree lhs_type = TREE_TYPE (TREE_TYPE (fndecl));
>   if (idx >= TYPE_PRECISION (TREE_TYPE (args[0])))
> return args[0];
>   if (idx == 0)
> -   return build_int_cst (TREE_TYPE (TREE_TYPE (fndecl)), 0);
> +   return omit_one_operand (lhs_type, build_zero_cst (lhs_type),
> +args[0]);
>   if (!tree_fits_uhwi_p (args[0]))
> break;
>   unsigned HOST_WIDE_INT res = tree_to_uhwi (args[0]);
>   res &= ~(HOST_WIDE_INT_M1U << idx);
> - return build_int_cstu (TREE_TYPE (TREE_TYPE (fndecl)), res);
> + return build_int_cstu (lhs_type, res);
> }
>   break;
>
> --- gcc/testsuite/gcc.target/i386/bmi-pr116287.c.jj 2024-08-08 
> 13:14:25.566827913 +0200
> +++ gcc/testsuite/gcc.target/i386/bmi-pr116287.c2024-08-08 
> 13:21:32.718312216 +0200
> @@ -0,0 +1,28 @@
> +/* PR target/116287 */
> +/* { dg-do run { target bmi } } */
> +/* { dg-options "-O2 -mbmi" } */
> +
> +#include 
> +
> +#include "bmi-check.h"
> +
> +static void
> +bmi_test ()
> +{
> +  unsigned int a = 0;
> +  if (__builtin_ia32_bextr_u32 (a++, 0) != 0)
> +abort ();
> +  if (__builtin_ia32_bextr_u32 (a++, 0x120) != 0)
> +abort ();
> +  if (a != 2)
> +abort ();
> +#ifdef __x86_64__
> +  unsigned long long b = 0;
> +  if (__builtin_ia32_bextr_u64 (b++, 0) != 0)
> +abort ();
> +  if (__builtin_ia32_bextr_u64 (b++, 0x140) != 0)
> +abort ();
> +  if (b != 2)
> +abort ();
> +#endif
> +}
> --- gcc/testsuite/gcc.target/i386/bmi2-pr116287.c.jj2024-08-08 
> 13:22:55.263246127 +0200
> +++ gcc/testsuite/gcc.target/i386/bmi2-pr116287.c   2024-08-08 
> 13:30:44.851181267 +0200
> @@ -0,0 +1,24 @@
> +/* PR target/116287 */
> +/* { dg-do run { target bmi2 } } */
> +/* { dg-options "-O2 -mbmi2" } */
> +
> +#include 
> +
> +#include "bmi2-check.h"
> +
> +static void
> +bmi2_test ()
> +{
> +  unsigned int a = 0;
> +  if (__builtin_ia32_bzhi_si (a++, 0) != 0)
> +abort ();
> +  if (a != 1)
> +abort ();
> +#ifdef __x86_64__
> +  unsigned long long b = 0;
> +  if (__builtin_ia32_bzhi_di (b++, 0) != 0)
> +abort ();
> +  if (b != 1)
> +abort ();
> +#endif
> +}
> --- gcc/testsuite/gcc.target/i386/tbm-pr116287.c.jj 2024-08-08 
> 13:14:48.453532722 +0200
> +++ gcc/testsuite/gcc.target/i386/tbm-pr116287.c2024-08-08 
> 13:22:36.940482770 +0200
> @@ -0,0 +1,29 @@
> +/* PR target/116287 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mtbm -fdump-tree-optimized" } */
> +/* { dg-final { scan-tree-dump-not "link_error \\\

Re: [x86 PATCH] Tweak ix86_mode_can_transfer_bits to restore bootstrap on RHEL.

2024-08-08 Thread Uros Bizjak

On Thu, Aug 8, 2024 at 10:28 AM Roger Sayle  wrote:
>
>
> This minor patch, very similar to one posted and approved previously at
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657229.html is
> required to restore builds on systems using gcc 4.8 as a host compiler.
> Using the enumeration constants E_SFmode and E_DFmode avoids issues with
> SFmode and DFmode being "non-literal types in constant expressions".
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, with no new failures.  Ok for mainline?
>
>
> 2024-08-08  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.cc (ix86_mode_can_transfer_bits): Use E_?Fmode
> enumeration constants in switch statement.

OK, also as an obvious patch.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

Re: [x86_64 PATCH] Support memory destinations and wide immediate constants in STV.

2024-08-06 Thread Uros Bizjak

On Mon, Aug 5, 2024 at 5:50 PM Roger Sayle  wrote:
>
>
> Hi Uros,
> Very many thanks for the quick review and approval.  Here's another.
>
> This patch implements two improvements/refinements to the i386 backend's
> Scalar-To-Vector (STV) pass.  The first is to support memory destinations
> in binary logic operations, and the second is to provide more accurate
> costs/gains for (wide) immediate constants in binary logic operations.

Please do not mix together changes made for different reasons, as
advised in "Contributing to GCC" [1], section "Submitting Patches".

[1] https://gcc.gnu.org/contribute.html

Uros.

>
> A motivating example is gcc.target/i386/movti-2.c:
>
> __int128 m;
> void foo()
> {
> m &= ((__int128)0x0123456789abcdefULL<<64) | 0x0123456789abcdefULL;
> }
>
> for which STV1 currently generates a warning/error:
> > r100 has non convertible use in insn 6
>
> (insn 5 2 6 2 (set (reg:TI 100)
> (const_wide_int 0x123456789abcdef0123456789abcdef)) "movti-2.c":7:7
> 87 {
> *movti_internal}
>  (nil))
> (insn 6 5 0 2 (parallel [
> (set (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]   0x7f36d1c
> 27c60 m>) [1 m+0 S16 A128])
> (and:TI (mem/c:TI (symbol_ref:DI ("m") [flags 0x2]
>  7f36d1c27c60 m>) [1 m+0 S16 A128])
> (reg:TI 100)))
> (clobber (reg:CC 17 flags))
> ]) "movti-2.c":7:7 645 {*andti3_doubleword}
>  (expr_list:REG_DEAD (reg:TI 100)
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil
>
> and therefore generates the following scalar code with -O2 -mavx
>
> foo:movabsq $81985529216486895, %rax
> andq%rax, m(%rip)
> andq%rax, m+8(%rip)
> ret
>
> with this patch we now support read-modify-write instructions (as STV
> candidates), splitting them into explicit read-modify instructions
> followed by an explicit write instruction.  Hence, we now produce
> (when not optimizing for size):
>
> foo:movabsq $81985529216486895, %rax
> vmovq   %rax, %xmm0
> vpunpcklqdq %xmm0, %xmm0, %xmm0
> vpand   m(%rip), %xmm0, %xmm0
> vmovdqa %xmm0, m(%rip)
> ret
>
> This code also handles the const_wide_int in example above, correcting
> the costs/gains when the hi/lo words are the same.  One minor complication
> is that the middle-end assumes (when generating memset) that SSE constants
> will be shared/amortized across multiple consecutive writes.  Hence to
> avoid testsuite regressions, we add a heuristic that considers an immediate
> constant to be very cheap, if that same immediate value occurs in the
> previous instruction or in the instruction after.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-08-05  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (timode_immed_const_gain): New
> function to determine the gain/cost on a CONST_WIDE_INT.
> (local_duplicate_constant_p): Helper function to see if the
> same immediate constant appears in the previous or next insn.
> (timode_scalar_chain::compute_convert_gain): Fix whitespace.
> : Provide more accurate estimates using
> timode_immed_const_gain and local_duplicate_constant_p.
> : Handle MEM_P (dst) and CONSTANT_SCALAR_INT_P (src).
> (timode_scalar_to_vector_candidate_p): Support the first operand
> of AND, IOR and XOR being MEM_P (i.e. a read-modify-write insn).
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/movti-2.c: Change dg-options to -Os.
> * gcc.target/i386/movti-4.c: Expected output of original movti-2.c.
>
>
> Thanks again,
> Roger
> --
>

Re: [x86_64 PATCH] Refactor V2DI arithmetic right shift expansion for STV.

2024-08-05 Thread Uros Bizjak

On Mon, Aug 5, 2024 at 12:22 PM Roger Sayle  wrote:
>
>
> This patch refactors ashrv2di RTL expansion into a function so that it may
> be reused by a pre-reload splitter, such that DImode right shifts may be
> considered candidates during the Scalar-To-Vector (STV) pass.  Currently
> DImode arithmetic right shifts are not considered potential candidates
> during STV, so for the following testcase:
>
> long long m;
> typedef long long v2di __attribute__((vector_size (16)));
> void foo(v2di x) { m = x[0]>>63; }
>
> We currently see the following warning/error during STV2
> >  r101 use in insn 7 isn't convertible
>
> And end up generating scalar code with an interunit move:
>
> foo:movq%xmm0, %rax
> sarq$63, %rax
> movq%rax, m(%rip)
> ret
>
> With this patch, we can reuse the RTL expansion logic and produce:
>
> foo:psrad   $31, %xmm0
> pshufd  $245, %xmm0, %xmm0
> movq%xmm0, m(%rip)
> ret
>
> Or with the addition of -mavx2, the equivalent:
>
> foo:vpxor   %xmm1, %xmm1, %xmm1
> vpcmpgtq%xmm0, %xmm1, %xmm0
> vmovq   %xmm0, m(%rip)
> ret
>
>
> The only design decision of note is the choice to continue lowering V2DI
> into vector sequences during RTL expansion, to enable combine to optimize
> things if possible.  Using just define_insn_and_split potentially misses
> optimizations, such as reusing the zero vector produced by vpxor above.
> It may be necessary to tweak STV's compute gain at some point, but this
> patch controls what's possible (rather than what's beneficial).
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
> 2024-08-05  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_v2di_ashiftrt): New
> function refactored from define_expand ashrv2di3.
> * config/i386/i386-features.cc
> (general_scalar_to_vector_candidate_p)
> : Handle like other shifts and rotates.
> * config/i386/i386-protos.h (ix86_expand_v2di_ashiftrt): Prototype.
> * config/i386/sse.md (ashrv2di3): Call ix86_expand_v2di_ashiftrt.
> (*ashrv2di3): New define_insn_and_split to enable creation by stv2
> pass, and splitting during split1 reusing ix86_expand_v2di_ashiftrt.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/sse2-stv-2.c: New test case.

LGTM.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

Re: [PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-08-01 Thread Uros Bizjak

On Tue, Jul 30, 2024 at 5:05 AM liuhongt  wrote:
>
> (insn 98 94 387 2 (parallel [
> (set (reg:TI 337 [ _32 ])
> (ashift:TI (reg:TI 329)
> (reg:QI 521)))
> (clobber (reg:CC 17 flags))
> ]) "test.c":11:13 953 {ashlti3_doubleword}
>
> is reloaded into
>
> (insn 98 452 387 2 (parallel [
> (set (reg:TI 0 ax [orig:337 _32 ] [337])
> (ashift:TI (const_int 1671291085 [0x639de0cd])
> (reg:QI 2 cx [521])))
> (clobber (reg:CC 17 flags))
>
> since constraint n in the pattern accepts that.
> (Not sure why reload doesn't check predicate)

This is how reload works. It doesn't look at predicates, only at
constraints. To avoid checking errors in later passes, the predicate
should allow a superset of operands compared to the operands of the
constraint. Basically, reload is "refining" the operands to fit
constraints.

OTOH, predicates are used by pre-reload passes, e.g. combine to
combine various instructions. This is the reason sometimes insn has
nonimmediate_operand predicate and "r" constraint - to allow insn
combination while expecting that reload will fix the operand to fit
the constraint. Post-reload passes check both, predicats and
constraints, this is where you get failures due to mismatched reload
operand.

> (define_insn "ashl3_doubleword"
>   [(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
>
> The patch fixes the mismatch between constraint and predicate.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/116096
> * config/i386/constraints.md (Wc): New constraint for integer
> 1 or -1.
> * config/i386/i386.md (ashl3_doubleword): Refine
> constraint with Wc.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116096.c: New test.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/constraints.md   |  6 ++
>  gcc/config/i386/i386.md  |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116096.c | 26 
>  3 files changed, 33 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116096.c
>
> diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> index 7508d7a58bd..154cbccd09e 100644
> --- a/gcc/config/i386/constraints.md
> +++ b/gcc/config/i386/constraints.md
> @@ -254,6 +254,12 @@ (define_constraint "Wb"
>(and (match_code "const_int")
> (match_test "IN_RANGE (ival, 0, 7)")))
>
> +(define_constraint "Wc"
> +  "Integer constant -1 or 1."
> +  (and (match_code "const_int")
> +   (ior (match_test "op == constm1_rtx")
> +   (match_test "op == const1_rtx"
> +
>  (define_constraint "Ww"
>"Integer constant in the range 0 @dots{} 15, for 16-bit shifts."
>(and (match_code "const_int")
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 6207036a2a0..79d5de5b46a 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1"
>
>  (define_insn "ashl3_doubleword"
>[(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> -   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> +   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0Wc,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
> (clobber (reg:CC FLAGS_REG))]
>""
> diff --git a/gcc/testsuite/gcc.target/i386/pr116096.c 
> b/gcc/testsuite/gcc.target/i386/pr116096.c
> new file mode 100644
> index 000..5ef39805f58
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116096.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O2 -flive-range-shrinkage -fno-peephole2 -mstackrealign 
> -Wno-psabi" } */
> +
> +typedef char U __attribute__((vector_size (32)));
> +typedef unsigned V __attribute__((vector_size (32)));
> +typedef __int128 W __attribute__((vector_size (32)));
> +U g;
> +
> +W baz ();
> +
> +static inline U
> +bar (V x, W y)
> +{
> +  y = y | y << (W) x;
> +  return (U)y;
> +}
> +
> +void
> +foo (W w)
> +{
> +  g = g <<
> +bar ((V){baz ()[1], 3, 3, 5, 7},
> +(W){w[0], ~(int) 2623676210}) >>
> +bar ((V){baz ()[1]},
> +(W){-w[0], ~(int) 2623676210});
> +}
> --
> 2.31.1
>

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 11:33 AM Richard Biener  wrote:

> > > > > > OK. Richard, can you please mention the above in the comment why
> > > > > > XFmode is rejected in the hook?
> > > > > >
> > > > > > Later, we can perhaps benchmark XFmode move vs. generic memory copy 
> > > > > > to
> > > > > > get some hard data.
> > > > >
> > > > > My (limited) understanding was that the hook would be used only for 
> > > > > cases
> > > > > where we'd like to e.g. value number some SF/DF/XF etc. mode loads 
> > > > > and some
> > > > > subsequent loads from the same address with different mode but same 
> > > > > size
> > > > > the same and replace say int or long long later load with 
> > > > > VIEW_CONVERT_EXPR
> > > > > of the result of the SF/SF mode load.  That is what was incorrect, 
> > > > > because
> > > > > the load didn't preserve all the bits.  The patch would still keep 
> > > > > doing
> > > > > normal SF/DF/XF etc. mode copies if that is all that happens in the 
> > > > > program,
> > > > > load some floating point value and store it elsewhere or as part of 
> > > > > larger
> > > > > aggregate copy.
> > > >
> > > > So, the hook should allow everything besides SF/DFmode, simply:
> > > >
> > > >
> > > > switch (GET_MODE_INNER (mode))
> > > >   {
> > > >   case SFmode:
> > > >   case DFmode:
> > > > /* These suffer from normalization upon load when not using 
> > > > SSE.  */
> > > > return !(ix86_fpmath & FPMATH_387);
> > > >   default:
> > > > return true;
> > > >   }
> > >
> > > OK, I think I'll go with this then.  I'm now unsure whether the
> > > wrapper around the hook should reject modes with padding or if
> > > the supposed users (value-numbering and SRA) should deal with that
> > > issue separately.  I do wonder whether
> > >
> > > ADJUST_FLOAT_FORMAT (XF, (TARGET_128BIT_LONG_DOUBLE
> > >   ? &ieee_extended_intel_128_format
> > >   : TARGET_96_ROUND_53_LONG_DOUBLE
> > >   ? &ieee_extended_intel_96_round_53_format
> > >   : &ieee_extended_intel_96_format));
> > > ADJUST_BYTESIZE  (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 12);
> > > ADJUST_ALIGNMENT (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 4);
> > >
> > > unambiguously specifies where the padding is - m68k has
> > >
> > > FRACTIONAL_FLOAT_MODE (XF, 80, 12, ieee_extended_motorola_format);
> > >
> > > It's also not clear we can model a x87 10 byte memory copy in RTL since
> > > a mem:XF still touches 12 or 16 bytes - IIRC a store leaves
> > > possible padding as unspecified and not "masked out" even if
> > > the actual fstp will only store 10 bytes.
> >
> > The hardware will never touch bytes outside 10 bytes range, the
> > padding is some artificial compiler thingy, so IMO it should be
> > handled before the hook is called. Please find attached the source I
> > have used to confirm that a) the copied bits will never be mangled and
> > b) there is no access outside the 10 bytes range. (BTW: these
> > particular values are to test the effect of leading bit 63, the
> > non-hidden normalized bit).
>
> Thanks - I do wonder why GET_MODE_SIZE (XFmode) is not 10 then,
> mode_base_align[XFmode] seems to be correctly set to ensure
> 12 bytes / 16 bytes "effective" size.

FTR, "long double" AKA __float80 is defined as fundamental type in psABI as:

sizeof 12, alignment 4 for i386 [1] and
sizeof 16, alignment 16 for x86_64 [2].

These values are thus set by ABI despite the fact that hardware
handles only 10 bytes.

[1] Table 2.1, page 8 of https://www.uclibc.org/docs/psABI-i386.pdf
[2] Figure 3.1, page 12 of
https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf

Uros.

Re: [PATCH 2/3] [x86] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 3:40 PM Richard Biener  wrote:
>
> The following implements the hook, excluding x87 modes for scalar
> and complex float modes.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK this way?
>
> Thanks,
> Richard.
>
> * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> (ix86_mode_can_transfer_bits): New function.

OK.

Thanks for your efforts and your patience to resolve this issue!

Uros.

> ---
>  gcc/config/i386/i386.cc | 22 ++
>  1 file changed, 22 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 12d15feb5e9..9869c44ee15 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26113,6 +26113,25 @@ ix86_have_ccmp ()
>return (bool) TARGET_APX_CCMP;
>  }
>
> +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> +static bool
> +ix86_mode_can_transfer_bits (machine_mode mode)
> +{
> +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> +switch (GET_MODE_INNER (mode))
> +  {
> +  case SFmode:
> +  case DFmode:
> +   /* These suffer from normalization upon load when not using SSE.  */
> +   return !(ix86_fpmath & FPMATH_387);
> +  default:
> +   return true;
> +  }
> +
> +  return true;
> +}
> +
>  /* Target-specific selftests.  */
>
>  #if CHECKING_P
> @@ -26959,6 +26978,9 @@ ix86_libgcc_floating_mode_supported_p
>  #undef TARGET_HAVE_CCMP
>  #define TARGET_HAVE_CCMP ix86_have_ccmp
>
> +#undef TARGET_MODE_CAN_TRANSFER_BITS
> +#define TARGET_MODE_CAN_TRANSFER_BITS ix86_mode_can_transfer_bits
> +
>  static bool
>  ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
>  {
> --
> 2.43.0
>

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 11:33 AM Richard Biener  wrote:
>
> On Wed, 31 Jul 2024, Uros Bizjak wrote:
>
> > On Wed, Jul 31, 2024 at 10:48 AM Richard Biener  wrote:
> > >
> > > On Wed, 31 Jul 2024, Uros Bizjak wrote:
> > >
> > > > On Wed, Jul 31, 2024 at 10:24 AM Jakub Jelinek  wrote:
> > > > >
> > > > > On Wed, Jul 31, 2024 at 10:11:44AM +0200, Uros Bizjak wrote:
> > > > > > OK. Richard, can you please mention the above in the comment why
> > > > > > XFmode is rejected in the hook?
> > > > > >
> > > > > > Later, we can perhaps benchmark XFmode move vs. generic memory copy 
> > > > > > to
> > > > > > get some hard data.
> > > > >
> > > > > My (limited) understanding was that the hook would be used only for 
> > > > > cases
> > > > > where we'd like to e.g. value number some SF/DF/XF etc. mode loads 
> > > > > and some
> > > > > subsequent loads from the same address with different mode but same 
> > > > > size
> > > > > the same and replace say int or long long later load with 
> > > > > VIEW_CONVERT_EXPR
> > > > > of the result of the SF/SF mode load.  That is what was incorrect, 
> > > > > because
> > > > > the load didn't preserve all the bits.  The patch would still keep 
> > > > > doing
> > > > > normal SF/DF/XF etc. mode copies if that is all that happens in the 
> > > > > program,
> > > > > load some floating point value and store it elsewhere or as part of 
> > > > > larger
> > > > > aggregate copy.
> > > >
> > > > So, the hook should allow everything besides SF/DFmode, simply:
> > > >
> > > >
> > > > switch (GET_MODE_INNER (mode))
> > > >   {
> > > >   case SFmode:
> > > >   case DFmode:
> > > > /* These suffer from normalization upon load when not using 
> > > > SSE.  */
> > > > return !(ix86_fpmath & FPMATH_387);
> > > >   default:
> > > > return true;
> > > >   }
> > >
> > > OK, I think I'll go with this then.  I'm now unsure whether the
> > > wrapper around the hook should reject modes with padding or if
> > > the supposed users (value-numbering and SRA) should deal with that
> > > issue separately.  I do wonder whether
> > >
> > > ADJUST_FLOAT_FORMAT (XF, (TARGET_128BIT_LONG_DOUBLE
> > >   ? &ieee_extended_intel_128_format
> > >   : TARGET_96_ROUND_53_LONG_DOUBLE
> > >   ? &ieee_extended_intel_96_round_53_format
> > >   : &ieee_extended_intel_96_format));
> > > ADJUST_BYTESIZE  (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 12);
> > > ADJUST_ALIGNMENT (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 4);
> > >
> > > unambiguously specifies where the padding is - m68k has
> > >
> > > FRACTIONAL_FLOAT_MODE (XF, 80, 12, ieee_extended_motorola_format);
> > >
> > > It's also not clear we can model a x87 10 byte memory copy in RTL since
> > > a mem:XF still touches 12 or 16 bytes - IIRC a store leaves
> > > possible padding as unspecified and not "masked out" even if
> > > the actual fstp will only store 10 bytes.
> >
> > The hardware will never touch bytes outside 10 bytes range, the
> > padding is some artificial compiler thingy, so IMO it should be
> > handled before the hook is called. Please find attached the source I
> > have used to confirm that a) the copied bits will never be mangled and
> > b) there is no access outside the 10 bytes range. (BTW: these
> > particular values are to test the effect of leading bit 63, the
> > non-hidden normalized bit).
>
> Thanks - I do wonder why GET_MODE_SIZE (XFmode) is not 10 then,
> mode_base_align[XFmode] seems to be correctly set to ensure
> 12 bytes / 16 bytes "effective" size.

Uh, this decision predates my involvement in GCC development by a long shot ;)

Uros.

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 10:48 AM Richard Biener  wrote:
>
> On Wed, 31 Jul 2024, Uros Bizjak wrote:
>
> > On Wed, Jul 31, 2024 at 10:24 AM Jakub Jelinek  wrote:
> > >
> > > On Wed, Jul 31, 2024 at 10:11:44AM +0200, Uros Bizjak wrote:
> > > > OK. Richard, can you please mention the above in the comment why
> > > > XFmode is rejected in the hook?
> > > >
> > > > Later, we can perhaps benchmark XFmode move vs. generic memory copy to
> > > > get some hard data.
> > >
> > > My (limited) understanding was that the hook would be used only for cases
> > > where we'd like to e.g. value number some SF/DF/XF etc. mode loads and 
> > > some
> > > subsequent loads from the same address with different mode but same size
> > > the same and replace say int or long long later load with 
> > > VIEW_CONVERT_EXPR
> > > of the result of the SF/SF mode load.  That is what was incorrect, because
> > > the load didn't preserve all the bits.  The patch would still keep doing
> > > normal SF/DF/XF etc. mode copies if that is all that happens in the 
> > > program,
> > > load some floating point value and store it elsewhere or as part of larger
> > > aggregate copy.
> >
> > So, the hook should allow everything besides SF/DFmode, simply:
> >
> >
> > switch (GET_MODE_INNER (mode))
> >   {
> >   case SFmode:
> >   case DFmode:
> > /* These suffer from normalization upon load when not using SSE.  */
> > return !(ix86_fpmath & FPMATH_387);
> >   default:
> > return true;
> >   }
>
> OK, I think I'll go with this then.  I'm now unsure whether the
> wrapper around the hook should reject modes with padding or if
> the supposed users (value-numbering and SRA) should deal with that
> issue separately.  I do wonder whether
>
> ADJUST_FLOAT_FORMAT (XF, (TARGET_128BIT_LONG_DOUBLE
>   ? &ieee_extended_intel_128_format
>   : TARGET_96_ROUND_53_LONG_DOUBLE
>   ? &ieee_extended_intel_96_round_53_format
>   : &ieee_extended_intel_96_format));
> ADJUST_BYTESIZE  (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 12);
> ADJUST_ALIGNMENT (XF, TARGET_128BIT_LONG_DOUBLE ? 16 : 4);
>
> unambiguously specifies where the padding is - m68k has
>
> FRACTIONAL_FLOAT_MODE (XF, 80, 12, ieee_extended_motorola_format);
>
> It's also not clear we can model a x87 10 byte memory copy in RTL since
> a mem:XF still touches 12 or 16 bytes - IIRC a store leaves
> possible padding as unspecified and not "masked out" even if
> the actual fstp will only store 10 bytes.

The hardware will never touch bytes outside 10 bytes range, the
padding is some artificial compiler thingy, so IMO it should be
handled before the hook is called. Please find attached the source I
have used to confirm that a) the copied bits will never be mangled and
b) there is no access outside the 10 bytes range. (BTW: these
particular values are to test the effect of leading bit 63, the
non-hidden normalized bit).

Thanks,
Uros.
int main ()
{
  volatile union cvt
  {
short s[6];
int i[3];
long double d;
  } x, y;

  x.s[5] = 0x5a5a; // guard
  x.s[4] = 0x;
  x.s[3] = 0x4000;
  x.s[2] = 0x1;
  x.s[1] = 0x0;
  x.s[0] = 0x;

  __builtin_printf("%08x %08x %08x\n", x.i[2], x.i[1], x.i[0]);

  y.s[5] = 0xa5a5;  // guard
   
  asm ("" : "=t" (y.d) : "0" (x.d));

  __builtin_printf("%08x %08x %08x\n", y.i[2], y.i[1], y.i[0]);

  if (y.s[0] != x.s[0]
  || y.s[1] != x.s[1]
  || y.s[2] != x.s[2]
  || y.s[3] != x.s[3]
  || y.s[4] != x.s[4])
__builtin_abort();

  return 0;
}

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 10:24 AM Jakub Jelinek  wrote:
>
> On Wed, Jul 31, 2024 at 10:11:44AM +0200, Uros Bizjak wrote:
> > OK. Richard, can you please mention the above in the comment why
> > XFmode is rejected in the hook?
> >
> > Later, we can perhaps benchmark XFmode move vs. generic memory copy to
> > get some hard data.
>
> My (limited) understanding was that the hook would be used only for cases
> where we'd like to e.g. value number some SF/DF/XF etc. mode loads and some
> subsequent loads from the same address with different mode but same size
> the same and replace say int or long long later load with VIEW_CONVERT_EXPR
> of the result of the SF/SF mode load.  That is what was incorrect, because
> the load didn't preserve all the bits.  The patch would still keep doing
> normal SF/DF/XF etc. mode copies if that is all that happens in the program,
> load some floating point value and store it elsewhere or as part of larger
> aggregate copy.

So, the hook should allow everything besides SF/DFmode, simply:


switch (GET_MODE_INNER (mode))
  {
  case SFmode:
  case DFmode:
/* These suffer from normalization upon load when not using SSE.  */
return !(ix86_fpmath & FPMATH_387);
  default:
return true;
  }

Uros,

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 10:02 AM Hongtao Liu  wrote:

> > > > > > On Tue, 30 Jul 2024, Richard Biener wrote:
> > > > > >
> > > > > > > > Oh, and please add a small comment why we don't use XFmode here.
> > > > > > >
> > > > > > > Will do.
> > > > > > >
> > > > > > > /* Do not enable XFmode, there is padding in it and it 
> > > > > > > suffers
> > > > > > >from normalization upon load like SFmode and DFmode 
> > > > > > > when
> > > > > > >not using SSE.  */
> > > > > >
> > > > > > Is it really true? I have no evidence of FLDT performing 
> > > > > > normalization
> > > > > > (as mentioned in PR 114659, if it did, there would be no way to 
> > > > > > spill/reload
> > > > > > x87 registers).
> > > > >
> > > > > What mangling fld performs depends on the contents of the FP control
> > > > > word which is awkward.  IIRC there's at least a bugreport that it
> > > > > turns sNaN into a qNaN, it seems I was wrong about denormals
> > > > > (when DM is not masked).  And yes, IIRC x87 instability is also
> > > > > related to spills (IIRC we spill in the actual mode of the reg, not in
> > > > > XFmode), but -fexcess-precision=standard should hopefully avoid that.
> > > > > It's also not clear whether all implementations conformed to the
> > > > > specs wrt extended-precision format loads.
> > > >
> > > > FYI, FLDT does not mangle long-double values and does not generate
> > > > exceptions. Please see [1], but ignore shadowed text and instead read
> > > > the "Floating-Point Exceptions" section. So, as far as hardware is
> > > > concerned, it *can* be used to transfer 10-byte values, but I don't
> > > > want to judge from the compiler PoV if this is the way to go. We can
> > > > enable it, perhaps temporarily to experiment a bit - it is easy to
> > > > disable if it causes problems.
> > > >
> > > > Let's CC Intel folks for their opinion, if it is worth using an aging
> > > > x87 to transfer 80-bit data.
> > > I prefer not, in another hook ix86_can_change_mode_class, we have
> > >
> > > 20372  /* x87 registers can't do subreg at all, as all values are
> > > reformatted
> > > 20373 to extended precision.  */
> > > 20374  if (MAYBE_FLOAT_CLASS_P (regclass))
> > > 20375return false;
> >
> > No, the above applies to SFmode subreg of XFmode value, which is a
> > no-go. My question refers to the plain XFmode (80-bit) moves, where
> > x87 is used simply to:
> >
> > fldt mem1
> > ...
> > fstp mem2
> >
> > where x87 is used to perform a move from one 80-bit location to the other.
> >
> > > I guess it eventually needs reload for XFmode.
> >
> > There are no reloads, as we would like to perform bit-exact 80-bit
> > move, e.g. array of 10 chars.
> Oh, It's memory copy.
> I suspect that the hardware doesn't enable memory renaming for x87 
> instructions.
> So I prefer not.

OK. Richard, can you please mention the above in the comment why
XFmode is rejected in the hook?

Later, we can perhaps benchmark XFmode move vs. generic memory copy to
get some hard data.

Thanks,
Uros.

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Uros Bizjak

On Wed, Jul 31, 2024 at 9:11 AM Hongtao Liu  wrote:
>
> On Wed, Jul 31, 2024 at 1:06 AM Uros Bizjak  wrote:
> >
> > On Tue, Jul 30, 2024 at 3:00 PM Richard Biener  wrote:
> > >
> > > On Tue, 30 Jul 2024, Alexander Monakov wrote:
> > >
> > > >
> > > > On Tue, 30 Jul 2024, Richard Biener wrote:
> > > >
> > > > > > Oh, and please add a small comment why we don't use XFmode here.
> > > > >
> > > > > Will do.
> > > > >
> > > > > /* Do not enable XFmode, there is padding in it and it suffers
> > > > >from normalization upon load like SFmode and DFmode when
> > > > >not using SSE.  */
> > > >
> > > > Is it really true? I have no evidence of FLDT performing normalization
> > > > (as mentioned in PR 114659, if it did, there would be no way to 
> > > > spill/reload
> > > > x87 registers).
> > >
> > > What mangling fld performs depends on the contents of the FP control
> > > word which is awkward.  IIRC there's at least a bugreport that it
> > > turns sNaN into a qNaN, it seems I was wrong about denormals
> > > (when DM is not masked).  And yes, IIRC x87 instability is also
> > > related to spills (IIRC we spill in the actual mode of the reg, not in
> > > XFmode), but -fexcess-precision=standard should hopefully avoid that.
> > > It's also not clear whether all implementations conformed to the
> > > specs wrt extended-precision format loads.
> >
> > FYI, FLDT does not mangle long-double values and does not generate
> > exceptions. Please see [1], but ignore shadowed text and instead read
> > the "Floating-Point Exceptions" section. So, as far as hardware is
> > concerned, it *can* be used to transfer 10-byte values, but I don't
> > want to judge from the compiler PoV if this is the way to go. We can
> > enable it, perhaps temporarily to experiment a bit - it is easy to
> > disable if it causes problems.
> >
> > Let's CC Intel folks for their opinion, if it is worth using an aging
> > x87 to transfer 80-bit data.
> I prefer not, in another hook ix86_can_change_mode_class, we have
>
> 20372  /* x87 registers can't do subreg at all, as all values are
> reformatted
> 20373 to extended precision.  */
> 20374  if (MAYBE_FLOAT_CLASS_P (regclass))
> 20375return false;

No, the above applies to SFmode subreg of XFmode value, which is a
no-go. My question refers to the plain XFmode (80-bit) moves, where
x87 is used simply to:

fldt mem1
...
fstp mem2

where x87 is used to perform a move from one 80-bit location to the other.

> I guess it eventually needs reload for XFmode.

There are no reloads, as we would like to perform bit-exact 80-bit
move, e.g. array of 10 chars.

Uros.

[committed] i386/testsuite: Add testcase for fixed PR [PR51492]

2024-07-30 Thread Uros Bizjak

PR target/51492

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr51492.c: New test.

Tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/testsuite/gcc.target/i386/pr51492.c 
b/gcc/testsuite/gcc.target/i386/pr51492.c
new file mode 100644
index 000..0892e0c79a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr51492.c
@@ -0,0 +1,19 @@
+/* PR target/51492 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+#define SIZE 65536
+#define WSIZE 64
+unsigned short head[SIZE] __attribute__((aligned(64)));
+
+void
+f(void)
+{
+  for (unsigned n = 0; n < SIZE; ++n) {
+unsigned short m = head[n];
+head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
+  }
+}
+
+/* { dg-final { scan-assembler "psubusw" } } */
+/* { dg-final { scan-assembler-not "paddw" } } */

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak

On Tue, Jul 30, 2024 at 3:00 PM Richard Biener  wrote:
>
> On Tue, 30 Jul 2024, Alexander Monakov wrote:
>
> >
> > On Tue, 30 Jul 2024, Richard Biener wrote:
> >
> > > > Oh, and please add a small comment why we don't use XFmode here.
> > >
> > > Will do.
> > >
> > > /* Do not enable XFmode, there is padding in it and it suffers
> > >from normalization upon load like SFmode and DFmode when
> > >not using SSE.  */
> >
> > Is it really true? I have no evidence of FLDT performing normalization
> > (as mentioned in PR 114659, if it did, there would be no way to spill/reload
> > x87 registers).
>
> What mangling fld performs depends on the contents of the FP control
> word which is awkward.  IIRC there's at least a bugreport that it
> turns sNaN into a qNaN, it seems I was wrong about denormals
> (when DM is not masked).  And yes, IIRC x87 instability is also
> related to spills (IIRC we spill in the actual mode of the reg, not in
> XFmode), but -fexcess-precision=standard should hopefully avoid that.
> It's also not clear whether all implementations conformed to the
> specs wrt extended-precision format loads.

FYI, FLDT does not mangle long-double values and does not generate
exceptions. Please see [1], but ignore shadowed text and instead read
the "Floating-Point Exceptions" section. So, as far as hardware is
concerned, it *can* be used to transfer 10-byte values, but I don't
want to judge from the compiler PoV if this is the way to go. We can
enable it, perhaps temporarily to experiment a bit - it is easy to
disable if it causes problems.

Let's CC Intel folks for their opinion, if it is worth using an aging
x87 to transfer 80-bit data.

[1] https://www.felixcloutier.com/x86/fld

Uros.

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak

On Tue, Jul 30, 2024 at 1:07 PM Uros Bizjak  wrote:
>
> On Tue, Jul 30, 2024 at 12:18 PM Richard Biener  wrote:
> >
> > The following implements the hook, excluding x87 modes for scalar
> > and complex float modes.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > OK?
> >
> > Thanks,
> > Richard.
> >
> > * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> > (ix86_mode_can_transfer_bits): New function.
> > ---
> >  gcc/config/i386/i386.cc | 21 +
> >  1 file changed, 21 insertions(+)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 12d15feb5e9..5184366916b 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
> >return (bool) TARGET_APX_CCMP;
> >  }
> >
> > +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> > +static bool
> > +ix86_mode_can_transfer_bits (machine_mode mode)
> > +{
> > +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> > +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> > +switch (GET_MODE_INNER (mode))
> > +  {
> > +  case SFmode:
> > +  case DFmode:
> > +   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;
>
> This can be simplified to:
>
> return !(ix86_fpmath & FPMATH_387);
>
> (Which implies that we should introduce TARGET_I387_MATH to parallel
> TARGET_SSE_MATH some day...)
>
> > +  default:
> > +   return false;
>
> We don't want to enable HFmode for transfers?

Oh, and please add a small comment why we don't use XFmode here.

Uros.

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak

On Tue, Jul 30, 2024 at 12:18 PM Richard Biener  wrote:
>
> The following implements the hook, excluding x87 modes for scalar
> and complex float modes.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?
>
> Thanks,
> Richard.
>
> * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> (ix86_mode_can_transfer_bits): New function.
> ---
>  gcc/config/i386/i386.cc | 21 +
>  1 file changed, 21 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 12d15feb5e9..5184366916b 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
>return (bool) TARGET_APX_CCMP;
>  }
>
> +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> +static bool
> +ix86_mode_can_transfer_bits (machine_mode mode)
> +{
> +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> +switch (GET_MODE_INNER (mode))
> +  {
> +  case SFmode:
> +  case DFmode:
> +   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;

This can be simplified to:

return !(ix86_fpmath & FPMATH_387);

(Which implies that we should introduce TARGET_I387_MATH to parallel
TARGET_SSE_MATH some day...)

> +  default:
> +   return false;

We don't want to enable HFmode for transfers?

Uros.

> +  }
> +
> +  return true;
> +}
> +
>  /* Target-specific selftests.  */
>
>  #if CHECKING_P
> @@ -26959,6 +26977,9 @@ ix86_libgcc_floating_mode_supported_p
>  #undef TARGET_HAVE_CCMP
>  #define TARGET_HAVE_CCMP ix86_have_ccmp
>
> +#undef TARGET_MODE_CAN_TRANSFER_BITS
> +#define TARGET_MODE_CAN_TRANSFER_BITS ix86_mode_can_transfer_bits
> +
>  static bool
>  ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
>  {
> --
> 2.43.0
>

Re: [PATCH v2] i386: Change prefetchi output template

2024-07-22 Thread Uros Bizjak

On Tue, Jul 23, 2024 at 4:59 AM Haochen Jiang  wrote:
>
> Hi all,
>
> I tested with %a and it works. Therefore I suppose it is a better solution.
>
> Bootstrapped and regtested on x86-64-pc-linux-gnu. Ok for trunk and backport
> to GCC 13 and 14?

OK, also for backports.

Thanks,
Uros.

>
> Thx,
> Haochen
>
> ---
>
> Changes in v2: Use %a in pattern
>
> ---
>
> For prefetchi instructions, RIP-relative address is explicitly mentioned
> for operand and assembler obeys that rule strictly. This makes
> instruction like:
>
> prefetchit0 bar
>
> got illegal for assembler, which should be a broad usage for prefetchi.
>
> Change to %a to explicitly add (%rip) after function label to make it
> legal in assembler so that it could pass to linker to get the real address.
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (prefetchi): Change to %a.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/prefetchi-1.c: Check (%rip).
> ---
>  gcc/config/i386/i386.md | 2 +-
>  gcc/testsuite/gcc.target/i386/prefetchi-1.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 90d3aa450f0..6207036a2a0 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -28004,7 +28004,7 @@
>"TARGET_PREFETCHI && TARGET_64BIT"
>  {
>static const char * const patterns[2] = {
> -"prefetchit1\t%0", "prefetchit0\t%0"
> +"prefetchit1\t%a0", "prefetchit0\t%a0"
>};
>
>int locality = INTVAL (operands[1]);
> diff --git a/gcc/testsuite/gcc.target/i386/prefetchi-1.c 
> b/gcc/testsuite/gcc.target/i386/prefetchi-1.c
> index 80f25e70e8e..03dfdc55e86 100644
> --- a/gcc/testsuite/gcc.target/i386/prefetchi-1.c
> +++ b/gcc/testsuite/gcc.target/i386/prefetchi-1.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mprefetchi -O2" } */
> -/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ \\t\]+" 2 } } */
> -/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ \\t\]+" 2 } } */
> +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ 
> \\t\]+bar\\(%rip\\)" 2 } } */
> +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ 
> \\t\]+bar\\(%rip\\)" 2 } } */
>
>  #include 
>
> --
> 2.31.1
>

Re: [PATCH] Relax ix86_hardreg_mov_ok after split1.

2024-07-22 Thread Uros Bizjak

On Tue, Jul 23, 2024 at 3:08 AM liuhongt  wrote:
>
> ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4
>
> >The solution proposed here is to have the x86 backend/recog prevent
> >early RTL passes composing instructions (that set likely_spilled hard
> >registers) that they (combine) can't simplify, until after reload.
> >We allow sets from pseudo registers, immediate constants and memory
> >accesses, but anything more complicated is performed via a temporary
> >pseudo.  Not only does this simplify things for the register allocator,
> >but any remaining register-to-register moves are easily cleaned up
> >by the late optimization passes after reload, such as peephole2 and
> >cprop_hardreg.
>
> The restriction is mainly for rtl optimization passes before pass_combine.
>
> But split1 splits
>
> ```
> (insn 17 13 18 2 (set (reg/i:V4SI 20 xmm0)
> (vec_merge:V4SI (const_vector:V4SI [
> (const_int -1 [0x]) repeated x4
> ])
> (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])
> (unspec:QI [
> (reg:V4SF 106)
> (reg:V4SF 102)
> (const_int 0 [0])
> ] UNSPEC_PCMP))) "/app/example.cpp":20:1 2929 
> {*avx_cmpv4sf3_1}
>  (expr_list:REG_DEAD (reg:V4SF 102)
> (expr_list:REG_DEAD (reg:V4SF 106)
> (nil
> ```
>
> into:
> ```
> (insn 23 13 24 2 (set (reg:V4SF 107)
> (unspec:V4SF [
> (reg:V4SF 106)
> (reg:V4SF 102)
> (const_int 0 [0])
> ] UNSPEC_PCMP)) "/app/example.cpp":20:1 -1
>  (nil))
> (insn 24 23 18 2 (set (reg/i:V4SI 20 xmm0)
> (subreg:V4SI (reg:V4SF 107) 0)) "/app/example.cpp":20:1 -1
>  (nil))
> ```
>
> There're many splitters generating MOV insn with SUBREG and would have
> same problem.
> Instead of changing those splitters one by one, the patch relaxes
> ix86_hard_mov_ok to allow mov subreg to hard register after
> split1. ix86_pre_reload_split () is used to replace
> !reload_completed && !lra_in_progress.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_hardreg_mov_ok): Relax mov subreg
> to hard register after split1.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/pr115982.C: New test.

LGTM, but please watch out for fallout.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.cc  |  5 ++---
>  gcc/testsuite/g++.target/i386/pr115982.C | 11 +++
>  2 files changed, 13 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/pr115982.C
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 9c2ebe74fc9..77c441893b4 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20212,7 +20212,7 @@ ix86_class_likely_spilled_p (reg_class_t rclass)
>  }
>
>  /* Return true if a set of DST by the expression SRC should be allowed.
> -   This prevents complex sets of likely_spilled hard regs before reload.  */
> +   This prevents complex sets of likely_spilled hard regs before split1.  */
>
>  bool
>  ix86_hardreg_mov_ok (rtx dst, rtx src)
> @@ -20224,8 +20224,7 @@ ix86_hardreg_mov_ok (rtx dst, rtx src)
>? standard_sse_constant_p (src, GET_MODE (dst))
>: x86_64_immediate_operand (src, GET_MODE (dst)))
>&& ix86_class_likely_spilled_p (REGNO_REG_CLASS (REGNO (dst)))
> -  && !reload_completed
> -  && !lra_in_progress)
> +  && ix86_pre_reload_split ())
>  return false;
>return true;
>  }
> diff --git a/gcc/testsuite/g++.target/i386/pr115982.C 
> b/gcc/testsuite/g++.target/i386/pr115982.C
> new file mode 100644
> index 000..4b91618405d
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/i386/pr115982.C
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512vl -O2" } */
> +
> +typedef float VF __attribute__((__vector_size__(16)));
> +typedef int VI __attribute__((__vector_size__(16)));
> +
> +VI
> +foo (VF x)
> +{
> +  return !x;
> +}
> --
> 2.31.1
>

[committed] libatomic: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

From: mayshao 

PR target/104688

libatomic/ChangeLog:

* config/x86/init.c (__libat_feat1_init): Don't clear
bit_AVX on ZHAOXIN CPUs.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
index 26168d46832..c6ce997a5af 100644
--- a/libatomic/config/x86/init.c
+++ b/libatomic/config/x86/init.c
@@ -41,11 +41,15 @@ __libat_feat1_init (void)
{
  /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned
 address is atomic, and AMD is going to do something similar soon.
-We don't have a guarantee from vendors of other CPUs with AVX,
-like Zhaoxin and VIA.  */
+Zhaoxin also guarantees this.  We don't have a guarantee
+from vendors of other CPUs with AVX, like VIA.  */
+ unsigned int family = (eax >> 8) & 0x0f;
  unsigned int ecx2;
  __cpuid (0, eax, ebx, ecx2, edx);
- if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
+ if (ecx2 != signature_INTEL_ecx
+ && ecx2 != signature_AMD_ecx
+ && !(ecx2 == signature_CENTAUR_ecx && family > 6)
+ && ecx2 != signature_SHANGHAI_ecx)
FEAT1_REGISTER &= ~bit_AVX;
}
 #endif

[committed] libatomic: Improve cpuid usage in __libat_feat1_init

2024-07-18 Thread Uros Bizjak

Check the result of __get_cpuid and process FEAT1_REGISTER only when
__get_cpuid returns success.  Use __cpuid instead of nested __get_cpuid.

libatomic/ChangeLog:

* config/x86/init.c (__libat_feat1_init): Check the result of
__get_cpuid and process FEAT1_REGISTER only when __get_cpuid
returns success.  Use __cpuid instead of nested __get_cpuid.

Bootstrapped and regression tested libatomic on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
index a75be3f175c..26168d46832 100644
--- a/libatomic/config/x86/init.c
+++ b/libatomic/config/x86/init.c
@@ -33,21 +33,23 @@ __libat_feat1_init (void)
 {
   unsigned int eax, ebx, ecx, edx;
   FEAT1_REGISTER = 0;
-  __get_cpuid (1, &eax, &ebx, &ecx, &edx);
-#ifdef __x86_64__
-  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
-  == (bit_AVX | bit_CMPXCHG16B))
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx))
 {
-  /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned address
-is atomic, and AMD is going to do something similar soon.
-We don't have a guarantee from vendors of other CPUs with AVX,
-like Zhaoxin and VIA.  */
-  unsigned int ecx2 = 0;
-  __get_cpuid (0, &eax, &ebx, &ecx2, &edx);
-  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
-   FEAT1_REGISTER &= ~bit_AVX;
-}
+#ifdef __x86_64__
+  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
+ == (bit_AVX | bit_CMPXCHG16B))
+   {
+ /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned
+address is atomic, and AMD is going to do something similar soon.
+We don't have a guarantee from vendors of other CPUs with AVX,
+like Zhaoxin and VIA.  */
+ unsigned int ecx2;
+ __cpuid (0, eax, ebx, ecx2, edx);
+ if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
+   FEAT1_REGISTER &= ~bit_AVX;
+   }
 #endif
+}
   /* See the load in load_feat1.  */
   __atomic_store_n (&__libat_feat1, FEAT1_REGISTER, __ATOMIC_RELAXED);
   return FEAT1_REGISTER;

Re: [PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 2:07 PM Jakub Jelinek  wrote:
>
> On Thu, Jul 18, 2024 at 01:57:11PM +0200, Uros Bizjak wrote:
> > Attached patch illustrates the proposed improvement with nested cpuid
> > calls. Bootstrapped and teased with libatomic testsuite.
> >
> > Jakub, WDYT?
>
> I'd probably keep the FEAT1_REGISTER = 0; before the if (__get_cpuid (1, ...)
> to avoid the else, I think that could result in smaller code, but otherwise

OK, I'll keep the initialization this way.

> LGTM, especially the use of just __cpuid there.  And note your patch doesn't
> incorporate the Zhaoxin changes.

This will be a separate patch.

Thanks,
Uros.

>
> > diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
> > index a75be3f175c..94d45683567 100644
> > --- a/libatomic/config/x86/init.c
> > +++ b/libatomic/config/x86/init.c
> > @@ -32,22 +32,25 @@ unsigned int
> >  __libat_feat1_init (void)
> >  {
> >unsigned int eax, ebx, ecx, edx;
> > -  FEAT1_REGISTER = 0;
> > -  __get_cpuid (1, &eax, &ebx, &ecx, &edx);
> > -#ifdef __x86_64__
> > -  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
> > -  == (bit_AVX | bit_CMPXCHG16B))
> > +  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx))
> >  {
> > -  /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned 
> > address
> > -  is atomic, and AMD is going to do something similar soon.
> > -  We don't have a guarantee from vendors of other CPUs with AVX,
> > -  like Zhaoxin and VIA.  */
> > -  unsigned int ecx2 = 0;
> > -  __get_cpuid (0, &eax, &ebx, &ecx2, &edx);
> > -  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
> > - FEAT1_REGISTER &= ~bit_AVX;
> > -}
> > +#ifdef __x86_64__
> > +  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
> > +   == (bit_AVX | bit_CMPXCHG16B))
> > + {
> > +   /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned
> > +  address is atomic, and AMD is going to do something similar soon.
> > +  We don't have a guarantee from vendors of other CPUs with AVX,
> > +  like Zhaoxin and VIA.  */
> > +   unsigned int ecx2;
> > +   __cpuid (0, eax, ebx, ecx2, edx);
> > +   if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
> > + FEAT1_REGISTER &= ~bit_AVX;
> > + }
> >  #endif
> > +}
> > +  else
> > +FEAT1_REGISTER = 0;
> >/* See the load in load_feat1.  */
> >__atomic_store_n (&__libat_feat1, FEAT1_REGISTER, __ATOMIC_RELAXED);
> >return FEAT1_REGISTER;
>
>
> Jakub
>

Re: [PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 10:31 AM Uros Bizjak  wrote:
>
> On Thu, Jul 18, 2024 at 10:21 AM Jakub Jelinek  wrote:
> >
> > On Thu, Jul 18, 2024 at 10:12:46AM +0200, Uros Bizjak wrote:
> > > On Thu, Jul 18, 2024 at 9:50 AM Jakub Jelinek  wrote:
> > > >
> > > > On Thu, Jul 18, 2024 at 09:34:14AM +0200, Uros Bizjak wrote:
> > > > > > > +  unsigned int ecx2 = 0, family = 0;
> > > > >
> > > > > No need to initialize these two variables.
> > > >
> > > > The function ignores __get_cpuid result, so at least the
> > > > FEAT1_REGISTER = 0; is needed before the first __get_cpuid.
> > > > Do you mean the ecx2 = 0 initialization is useless because
> > > > __get_cpuid (0, ...) on x86_64 will always succeed (especially
> > > > when __get_cpuid (1, ...) had to succeed otherwise FEAT1_REGISTER
> > > > would be zero)?
> > > > I guess that is true, but won't that cause -Wmaybe-uninitialized 
> > > > warnings?
> > >
> > > Yes, if the __get_cpuid (1, ...) works OK, then we are sure that
> > > __get_cpuid (0, ...) will also work.
> > >
> > > > I agree initializing family to 0 is not needed, but I don't understand
> > > > why it isn't just
> > > >   unsigned family = (eax >> 8) & 0x0f;
> > > > Though, guess even that might fail with -Wmaybe-uninitialized too, as
> > > > eax isn't unconditionally initialized.
> > >
> > > Perhaps we should check the result of __get_cpuid (1, ...) and use eax
> > > only if the function returns 1? IMO, this would solve the
> > > uninitialized issue, and we could use __cpuid in the second case (we
> > > would know that leaf 0 is supported, because leaf 1 support was
> > > checked with __get_cpuid (1, ...)).
> >
> > We know the code is ok if FEAT1_REGISTER = 0; is done before __get_cpuid (1,
> > ...).
> > Everything else is implied from it, all we need to ensure is that
> > -Wmaybe-uninitialized is happy about it.
> > Whatever doesn't report the warning and ideally doesn't increase the size of
> > the function.
> > I think the reason it is written the way it is before the AVX hacks in it
> > is that we need to handle even the case when __get_cpuid (1, ...) returns 0,
> > and we want in that case FEAT1_REGISTER = 0.
> > So it could be
>
> Yes, I think this is better, see below.
>
> >   FEAT1_REGISTER = 0;
> > #ifdef __x86_64__
> >   if (__get_cpuid (1, &eax, &ebx, &ecx, &edx)
> >   && (FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
> >   == (bit_AVX | bit_CMPXCHG16B))
> > {
>
> Here we can simply use

Attached patch illustrates the proposed improvement with nested cpuid
calls. Bootstrapped and teased with libatomic testsuite.

Jakub, WDYT?

Uros.
diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
index a75be3f175c..94d45683567 100644
--- a/libatomic/config/x86/init.c
+++ b/libatomic/config/x86/init.c
@@ -32,22 +32,25 @@ unsigned int
 __libat_feat1_init (void)
 {
   unsigned int eax, ebx, ecx, edx;
-  FEAT1_REGISTER = 0;
-  __get_cpuid (1, &eax, &ebx, &ecx, &edx);
-#ifdef __x86_64__
-  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
-  == (bit_AVX | bit_CMPXCHG16B))
+  if (__get_cpuid (1, &eax, &ebx, &ecx, &edx))
 {
-  /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned address
-is atomic, and AMD is going to do something similar soon.
-We don't have a guarantee from vendors of other CPUs with AVX,
-like Zhaoxin and VIA.  */
-  unsigned int ecx2 = 0;
-  __get_cpuid (0, &eax, &ebx, &ecx2, &edx);
-  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
-   FEAT1_REGISTER &= ~bit_AVX;
-}
+#ifdef __x86_64__
+  if ((FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
+ == (bit_AVX | bit_CMPXCHG16B))
+   {
+ /* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned
+address is atomic, and AMD is going to do something similar soon.
+We don't have a guarantee from vendors of other CPUs with AVX,
+like Zhaoxin and VIA.  */
+ unsigned int ecx2;
+ __cpuid (0, eax, ebx, ecx2, edx);
+ if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
+   FEAT1_REGISTER &= ~bit_AVX;
+   }
 #endif
+}
+  else
+FEAT1_REGISTER = 0;
   /* See the load in load_feat1.  */
   __atomic_store_n (&__libat_feat1, FEAT1_REGISTER, __ATOMIC_RELAXED);
   return FEAT1_REGISTER;

Re: [PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 10:21 AM Jakub Jelinek  wrote:
>
> On Thu, Jul 18, 2024 at 10:12:46AM +0200, Uros Bizjak wrote:
> > On Thu, Jul 18, 2024 at 9:50 AM Jakub Jelinek  wrote:
> > >
> > > On Thu, Jul 18, 2024 at 09:34:14AM +0200, Uros Bizjak wrote:
> > > > > > +  unsigned int ecx2 = 0, family = 0;
> > > >
> > > > No need to initialize these two variables.
> > >
> > > The function ignores __get_cpuid result, so at least the
> > > FEAT1_REGISTER = 0; is needed before the first __get_cpuid.
> > > Do you mean the ecx2 = 0 initialization is useless because
> > > __get_cpuid (0, ...) on x86_64 will always succeed (especially
> > > when __get_cpuid (1, ...) had to succeed otherwise FEAT1_REGISTER
> > > would be zero)?
> > > I guess that is true, but won't that cause -Wmaybe-uninitialized warnings?
> >
> > Yes, if the __get_cpuid (1, ...) works OK, then we are sure that
> > __get_cpuid (0, ...) will also work.
> >
> > > I agree initializing family to 0 is not needed, but I don't understand
> > > why it isn't just
> > >   unsigned family = (eax >> 8) & 0x0f;
> > > Though, guess even that might fail with -Wmaybe-uninitialized too, as
> > > eax isn't unconditionally initialized.
> >
> > Perhaps we should check the result of __get_cpuid (1, ...) and use eax
> > only if the function returns 1? IMO, this would solve the
> > uninitialized issue, and we could use __cpuid in the second case (we
> > would know that leaf 0 is supported, because leaf 1 support was
> > checked with __get_cpuid (1, ...)).
>
> We know the code is ok if FEAT1_REGISTER = 0; is done before __get_cpuid (1,
> ...).
> Everything else is implied from it, all we need to ensure is that
> -Wmaybe-uninitialized is happy about it.
> Whatever doesn't report the warning and ideally doesn't increase the size of
> the function.
> I think the reason it is written the way it is before the AVX hacks in it
> is that we need to handle even the case when __get_cpuid (1, ...) returns 0,
> and we want in that case FEAT1_REGISTER = 0.
> So it could be

Yes, I think this is better, see below.

>   FEAT1_REGISTER = 0;
> #ifdef __x86_64__
>   if (__get_cpuid (1, &eax, &ebx, &ecx, &edx)
>   && (FEAT1_REGISTER & (bit_AVX | bit_CMPXCHG16B))
>   == (bit_AVX | bit_CMPXCHG16B))
> {

Here we can simply use

unsigned int family = (eax >> 8) & 0x0f;
unsigned int ecx2;

__cpuid (0, eax, ebx, ecx2, edx);

if (ecx2 ...)

> ...
> }
> #else
>   __get_cpuid (1, &eax, &ebx, &ecx, &edx);
> #endif
> etc.
>
> Jakub

Uros.

Re: [PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 9:50 AM Jakub Jelinek  wrote:
>
> On Thu, Jul 18, 2024 at 09:34:14AM +0200, Uros Bizjak wrote:
> > > > +  unsigned int ecx2 = 0, family = 0;
> >
> > No need to initialize these two variables.
>
> The function ignores __get_cpuid result, so at least the
> FEAT1_REGISTER = 0; is needed before the first __get_cpuid.
> Do you mean the ecx2 = 0 initialization is useless because
> __get_cpuid (0, ...) on x86_64 will always succeed (especially
> when __get_cpuid (1, ...) had to succeed otherwise FEAT1_REGISTER
> would be zero)?
> I guess that is true, but won't that cause -Wmaybe-uninitialized warnings?

Yes, if the __get_cpuid (1, ...) works OK, then we are sure that
__get_cpuid (0, ...) will also work.

> I agree initializing family to 0 is not needed, but I don't understand
> why it isn't just
>   unsigned family = (eax >> 8) & 0x0f;
> Though, guess even that might fail with -Wmaybe-uninitialized too, as
> eax isn't unconditionally initialized.

Perhaps we should check the result of __get_cpuid (1, ...) and use eax
only if the function returns 1? IMO, this would solve the
uninitialized issue, and we could use __cpuid in the second case (we
would know that leaf 0 is supported, because leaf 1 support was
checked with __get_cpuid (1, ...)).

Uros.

Re: [PATCH v2] [libatomic]: Handle AVX+CX16 ZHAOXIN like intel for 16b atomic [PR104688]

2024-07-18 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 9:29 AM Jakub Jelinek  wrote:
>
> On Thu, Jul 18, 2024 at 03:23:05PM +0800, MayShao-oc wrote:
> > From: mayshao 
> >
> > Hi Jakub:
> >
> > Thanks for your review,We should just amend this to handle Zhaoxin.
> >
> > Bootstrapped /regtested X86_64.
> >
> > Ok for trunk?
> > BR
> > Mayshao
> >
> > libatomic/ChangeLog:
> >
> >   PR target/104688
> >   * config/x86/init.c (__libat_feat1_init): Don't clear
> >   bit_AVX on ZHAOXIN CPUs.
> > ---
> >  libatomic/config/x86/init.c | 13 -
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/libatomic/config/x86/init.c b/libatomic/config/x86/init.c
> > index a75be3f175c..0d6864909bb 100644
> > --- a/libatomic/config/x86/init.c
> > +++ b/libatomic/config/x86/init.c
> > @@ -39,12 +39,15 @@ __libat_feat1_init (void)
> >== (bit_AVX | bit_CMPXCHG16B))
> >  {
> >/* Intel SDM guarantees that 16-byte VMOVDQA on 16-byte aligned 
> > address
> > -  is atomic, and AMD is going to do something similar soon.
> > -  We don't have a guarantee from vendors of other CPUs with AVX,
> > -  like Zhaoxin and VIA.  */
> > -  unsigned int ecx2 = 0;
> > +  is atomic, and AMD is going to do something similar soon. Zhaoxin 
> > also
>
> Two spaces before Zhaoxin (and also should go on another line).
>
> > +  guarantees this. We don't have a guarantee from vendors of other CPUs
>
> Two spaces before We (and again, the line will be too long).
>
> > +  with AVX,like VIA.  */
>
> Space before like
>
> > +  unsigned int ecx2 = 0, family = 0;

No need to initialize these two variables. Please also add one line of
vertical space after variable declarations.

OK with the above change and with Jakub's proposed formatting changes.

Thanks,
Uros.

> > +  family = (eax >> 8) & 0x0f;
> >__get_cpuid (0, &eax, &ebx, &ecx2, &edx);
> > -  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx)
> > +  if (ecx2 != signature_INTEL_ecx && ecx2 != signature_AMD_ecx
>
> If the whole condition can't fit on one line, then each subcondition should
> be on a separate line, so linebreak + indentation should be added also
> before && ecx2 != signature_AMD_ecx
>
> > +  && !(ecx2 == signature_CENTAUR_ecx && family > 0x6)
> > +  && ecx2 != signature_SHANGHAI_ecx)
> >   FEAT1_REGISTER &= ~bit_AVX;
> >  }
> >  #endif
> > --
> > 2.27.0
>
> Otherwise LGTM, but please give Uros a day or two to chime in.
>
> Jakub
>

Re: [PATCH v2] i386: Fix testcases generating invalid asm

2024-07-17 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 8:52 AM Haochen Jiang  wrote:
>
> Hi all,
>
> I revised the patch according to the comment.
>
> Ok for trunk?
>
> Thx,
> Haochen
>
> ---
>
> Changes in v2: Add suffix for mov to make the test more robust.
>
> ---
>
> For compile test, we should generate valid asm except for special purposes.
> Fix the compile test that generates invalid asm.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-egprs-names.c: Use ax for short and
> al for char instead of eax.
> * gcc.target/i386/avx512bw-kandnq-1.c: Do not run the test
> under -m32 since kmovq with register is invalid. Use long
> long to use 64 bit register instead of 32 bit register for
> kmovq.
> * gcc.target/i386/avx512bw-kandq-1.c: Ditto.
> * gcc.target/i386/avx512bw-knotq-1.c: Ditto.
> * gcc.target/i386/avx512bw-korq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kshiftlq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kshiftrq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kxnorq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kxorq-1.c: Ditto.

LGTM.

Thanks,
Uros.

> ---
>  gcc/testsuite/gcc.target/i386/apx-egprs-names.c | 8 
>  gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c   | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c| 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c| 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-korq-1.c | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kshiftlq-1.c | 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-kshiftrq-1.c | 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-kxnorq-1.c   | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kxorq-1.c| 6 +++---
>  9 files changed, 25 insertions(+), 25 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/apx-egprs-names.c 
> b/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> index f0517e47c33..917ef505495 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> @@ -10,8 +10,8 @@ void foo ()
>register int b __asm ("r30");
>register short c __asm ("r29");
>register char d __asm ("r28");
> -  __asm__ __volatile__ ("mov %0, %%rax" : : "r" (a) : "rax");
> -  __asm__ __volatile__ ("mov %0, %%eax" : : "r" (b) : "eax");
> -  __asm__ __volatile__ ("mov %0, %%eax" : : "r" (c) : "eax");
> -  __asm__ __volatile__ ("mov %0, %%eax" : : "r" (d) : "eax");
> +  __asm__ __volatile__ ("movq %0, %%rax" : : "r" (a) : "rax");
> +  __asm__ __volatile__ ("movl %0, %%eax" : : "r" (b) : "eax");
> +  __asm__ __volatile__ ("movw %0, %%ax" : : "r" (c) : "ax");
> +  __asm__ __volatile__ ("movb %0, %%al" : : "r" (d) : "al");
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> index e8b7a5f9aa2..f9f03c90782 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2" } */
>  /* { dg-final { scan-assembler-times "kandnq\[ 
> \\t\]+\[^\{\n\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
>
> @@ -10,8 +10,8 @@ avx512bw_test ()
>__mmask64 k1, k2, k3;
>volatile __m512i x = _mm512_setzero_si512 ();
>
> -  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1) );
> -  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2) );
> +  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1ULL) );
> +  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2ULL) );
>
>k3 = _kandn_mask64 (k1, k2);
>x = _mm512_mask_add_epi8 (x, k3, x, x);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> index a1aaed67c66..6ad836087ad 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2" } */
>  /* { dg-final { scan-assembler-times "kandq\[ 
> \\t\]+\[^\{\n\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
>
> @@ -10,8 +10,8 @@ avx512bw_test ()
>__mmask64 k1, k2, k3;
>volatile __m512i x = _mm512_setzero_epi32();
>
> -  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1) );
> -  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2) );
> +  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1ULL) );
> +  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2ULL) );
>
>k3 = _kand_mask64 (k1, k2);
>x = _mm512_mask_add_epi8 (x, k3, x, x);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> index deb65795760..341bbc03847 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2"

Re: [PATCH v2] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-17 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 3:35 AM liuhongt  wrote:
>
> > Also, in case the insn is deleted, do:
> >
> > emit_note (NOTE_INSN_DELETED);
> >
> > DONE;
> >
> > instead of leaving (const_int 0) in the stream.
> >
> > So, the above insn preparation statements should read:
> >
> > --cut here--
> > if (constm1_operand (operands[2], mode))
> >   emit_move_insn (operands[0], operands[1]);
> > else
> >   emit_note (NOTE_INSN_DELETED);
> >
> > DONE;
> > --cut here--
> Changed.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/115843
> * config/i386/predicates.md (const0_or_m1_operand): New
> predicate.
> * config/i386/sse.md (*_store_mask_1): New
> pre_reload define_insn_and_split.
> (V): Add V32BF,V16BF,V8BF.
> (V4SF_V8BF): Rename to ..
> (V24F_128): .. this.
> (*vec_concat): Adjust with V24F_128.
> (*vec_concat_0): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr115843.c: New test.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/predicates.md|  5 
>  gcc/config/i386/sse.md   | 33 
>  gcc/testsuite/gcc.target/i386/pr115843.c | 38 
>  3 files changed, 70 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115843.c
>
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 5d0bb1e0f54..680594871de 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -825,6 +825,11 @@ (define_predicate "constm1_operand"
>(and (match_code "const_int")
> (match_test "op == constm1_rtx")))
>
> +;; Match 0 or -1.
> +(define_predicate "const0_or_m1_operand"
> +  (ior (match_operand 0 "const0_operand")
> +   (match_operand 0 "constm1_operand")))
> +
>  ;; Match exactly eight.
>  (define_predicate "const8_operand"
>(and (match_code "const_int")
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index e44822f705b..f54e966bdbb 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -294,6 +294,7 @@ (define_mode_iterator V
> (V16SI "TARGET_AVX512F && TARGET_EVEX512") (V8SI "TARGET_AVX") V4SI
> (V8DI "TARGET_AVX512F && TARGET_EVEX512")  (V4DI "TARGET_AVX") V2DI
> (V32HF "TARGET_AVX512F && TARGET_EVEX512") (V16HF "TARGET_AVX") V8HF
> +   (V32BF "TARGET_AVX512F && TARGET_EVEX512") (V16BF "TARGET_AVX") V8BF
> (V16SF "TARGET_AVX512F && TARGET_EVEX512") (V8SF "TARGET_AVX") V4SF
> (V8DF "TARGET_AVX512F && TARGET_EVEX512")  (V4DF "TARGET_AVX") (V2DF 
> "TARGET_SSE2")])
>
> @@ -430,8 +431,8 @@ (define_mode_iterator VFB_512
> (V16SF "TARGET_EVEX512")
> (V8DF "TARGET_EVEX512")])
>
> -(define_mode_iterator V4SF_V8HF
> -  [V4SF V8HF])
> +(define_mode_iterator V24F_128
> +  [V4SF V8HF V8BF])
>
>  (define_mode_iterator VI48_AVX512VL
>[(V16SI "TARGET_EVEX512") (V8SI "TARGET_AVX512VL") (V4SI "TARGET_AVX512VL")
> @@ -11543,8 +11544,8 @@ (define_insn "*vec_concatv2sf_sse"
> (set_attr "mode" "V4SF,SF,DI,DI")])
>
>  (define_insn "*vec_concat"
> -  [(set (match_operand:V4SF_V8HF 0 "register_operand"   "=x,v,x,v")
> -   (vec_concat:V4SF_V8HF
> +  [(set (match_operand:V24F_128 0 "register_operand"   "=x,v,x,v")
> +   (vec_concat:V24F_128
>   (match_operand: 1 "register_operand" " 0,v,0,v")
>   (match_operand: 2 "nonimmediate_operand" " 
> x,v,m,m")))]
>"TARGET_SSE"
> @@ -11559,8 +11560,8 @@ (define_insn "*vec_concat"
> (set_attr "mode" "V4SF,V4SF,V2SF,V2SF")])
>
>  (define_insn "*vec_concat_0"
> -  [(set (match_operand:V4SF_V8HF 0 "register_operand"   "=v")
> -   (vec_concat:V4SF_V8HF
> +  [(set (match_operand:V24F_128 0 "register_operand"   "=v")
> +   (vec_concat:V24F_128
>   (match_operand: 1 "nonimmediate_operand" "vm")
>   (match_operand: 2 "const0_operand")))]
>"TARGET_SSE2"
> @@ -28574,6 +28575,26 @@ (define_insn "_store_mask"
> (set_attr "memory" "store")
> (set_attr "mode" "")])
>
> +(define_insn_and_split "*_store_mask_1"
> +  [(set (match_operand:V 0 "memory_operand")
> +   (unspec:V
> + [(match_operand:V 1 "register_operand")
> +  (match_dup 0)
> +  (match_operand: 2 "const0_or_m1_operand")]
> + UNSPEC_MASKMOV))]
> +  "TARGET_AVX512F && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  if (constm1_operand (operands[2], mode))
> +emit_move_insn (operands[0], operands[1]);
> +  else
> +emit_note (NOTE_INSN_DELETED);
> +
> +  DONE;
> +})
> +
>  (define_expand "cbranch4"
>[(set (reg:CC FLAGS_REG)
> (compare:CC (match_operand:VI_AVX_AVX512F 1 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/i386/pr115843.c 
> b/gcc/testsuite/gcc.target/i386/pr115843.c
> new file mode 100644
> index 000..00d8605757a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115843.c
>

Re: [PATCH] i386: Fix testcases generating invalid asm

2024-07-17 Thread Uros Bizjak

On Thu, Jul 18, 2024 at 3:46 AM Haochen Jiang  wrote:
>
> Hi all,
>
> For compile test, we should generate valid asm except for special purposes.
> Fix the compile test that generates invalid asm.
>
> Regtested on x86-64-pc-linux-gnu. Ok for trunk?
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-egprs-names.c: Use ax for short and
> al for char instead of eax.
> * gcc.target/i386/avx512bw-kandnq-1.c: Do not run the test
> under -m32 since kmovq with register is invalid. Use long
> long to use 64 bit register instead of 32 bit register for
> kmovq.
> * gcc.target/i386/avx512bw-kandq-1.c: Ditto.
> * gcc.target/i386/avx512bw-knotq-1.c: Ditto.
> * gcc.target/i386/avx512bw-korq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kshiftlq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kshiftrq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kxnorq-1.c: Ditto.
> * gcc.target/i386/avx512bw-kxorq-1.c: Ditto.
> ---
>  gcc/testsuite/gcc.target/i386/apx-egprs-names.c | 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c   | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c| 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c| 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-korq-1.c | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kshiftlq-1.c | 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-kshiftrq-1.c | 4 ++--
>  gcc/testsuite/gcc.target/i386/avx512bw-kxnorq-1.c   | 6 +++---
>  gcc/testsuite/gcc.target/i386/avx512bw-kxorq-1.c| 6 +++---
>  9 files changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/apx-egprs-names.c 
> b/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> index f0517e47c33..5b342aa385b 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-egprs-names.c
> @@ -12,6 +12,6 @@ void foo ()
>register char d __asm ("r28");
>__asm__ __volatile__ ("mov %0, %%rax" : : "r" (a) : "rax");
>__asm__ __volatile__ ("mov %0, %%eax" : : "r" (b) : "eax");
> -  __asm__ __volatile__ ("mov %0, %%eax" : : "r" (c) : "eax");
> -  __asm__ __volatile__ ("mov %0, %%eax" : : "r" (d) : "eax");
> +  __asm__ __volatile__ ("mov %0, %%ax" : : "r" (c) : "ax");
> +  __asm__ __volatile__ ("mov %0, %%al" : : "r" (d) : "al");

You can use the insn suffix (movq, movl, movw and movb) to make the
asm even more robust.

Uros.

>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> index e8b7a5f9aa2..f9f03c90782 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-kandnq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2" } */
>  /* { dg-final { scan-assembler-times "kandnq\[ 
> \\t\]+\[^\{\n\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
>
> @@ -10,8 +10,8 @@ avx512bw_test ()
>__mmask64 k1, k2, k3;
>volatile __m512i x = _mm512_setzero_si512 ();
>
> -  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1) );
> -  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2) );
> +  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1ULL) );
> +  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2ULL) );
>
>k3 = _kandn_mask64 (k1, k2);
>x = _mm512_mask_add_epi8 (x, k3, x, x);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> index a1aaed67c66..6ad836087ad 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-kandq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2" } */
>  /* { dg-final { scan-assembler-times "kandq\[ 
> \\t\]+\[^\{\n\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
>
> @@ -10,8 +10,8 @@ avx512bw_test ()
>__mmask64 k1, k2, k3;
>volatile __m512i x = _mm512_setzero_epi32();
>
> -  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1) );
> -  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2) );
> +  __asm__( "kmovq %1, %0" : "=k" (k1) : "r" (1ULL) );
> +  __asm__( "kmovq %1, %0" : "=k" (k2) : "r" (2ULL) );
>
>k3 = _kand_mask64 (k1, k2);
>x = _mm512_mask_add_epi8 (x, k3, x, x);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> index deb65795760..341bbc03847 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bw-knotq-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { ! ia32 } } } */
>  /* { dg-options "-mavx512bw -O2" } */
>  /* { dg-final { scan-assembler-times "knotq\[ 
> \\t\]+\[^\{\n\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
>
> @@ -10,7 +10,7 @@ avx512bw_test ()
>__mmask64 k1, k2;
>volatile __m512i x = _mm512_setzero_si512 ();
>
> -  __a

[committed] alpha: Fix duplicate !tlsgd!62 assemble error [PR115526]

2024-07-17 Thread Uros Bizjak

Add missing "cannot_copy" attribute to instructions that have to
stay in 1-1 correspondence with another insn.

PR target/115526

gcc/ChangeLog:

* config/alpha/alpha.md (movdi_er_high_g): Add cannot_copy attribute.
(movdi_er_tlsgd): Ditto.
(movdi_er_tlsldm): Ditto.
(call_value_osf_): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/alpha/pr115526.c: New test.

Tested by Maciej on Alpha/Linux target and reported in the PR.

Uros.
diff --git a/gcc/config/alpha/alpha.md b/gcc/config/alpha/alpha.md
index 1e2de5a4d15..bd92392878e 100644
--- a/gcc/config/alpha/alpha.md
+++ b/gcc/config/alpha/alpha.md
@@ -3902,7 +3902,8 @@ (define_insn "movdi_er_high_g"
   else
 return "ldq %0,%2(%1)\t\t!literal!%3";
 }
-  [(set_attr "type" "ldsym")])
+  [(set_attr "type" "ldsym")
+   (set_attr "cannot_copy" "true")])
 
 (define_split
   [(set (match_operand:DI 0 "register_operand")
@@ -3926,7 +3927,8 @@ (define_insn "movdi_er_tlsgd"
 return "lda %0,%2(%1)\t\t!tlsgd";
   else
 return "lda %0,%2(%1)\t\t!tlsgd!%3";
-})
+}
+  [(set_attr "cannot_copy" "true")])
 
 (define_insn "movdi_er_tlsldm"
   [(set (match_operand:DI 0 "register_operand" "=r")
@@ -3939,7 +3941,8 @@ (define_insn "movdi_er_tlsldm"
 return "lda %0,%&(%1)\t\t!tlsldm";
   else
 return "lda %0,%&(%1)\t\t!tlsldm!%2";
-})
+}
+  [(set_attr "cannot_copy" "true")])
 
 (define_insn "*movdi_er_gotdtp"
   [(set (match_operand:DI 0 "register_operand" "=r")
@@ -5908,6 +5911,7 @@ (define_insn "call_value_osf_"
   "HAVE_AS_TLS"
   "ldq $27,%1($29)\t\t!literal!%2\;jsr $26,($27),%1\t\t!lituse_!%2\;ldah 
$29,0($26)\t\t!gpdisp!%*\;lda $29,0($29)\t\t!gpdisp!%*"
   [(set_attr "type" "jsr")
+   (set_attr "cannot_copy" "true")
(set_attr "length" "16")])
 
 ;; We must use peep2 instead of a split because we need accurate life
diff --git a/gcc/testsuite/gcc.target/alpha/pr115526.c 
b/gcc/testsuite/gcc.target/alpha/pr115526.c
new file mode 100644
index 000..2f57903fec3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/alpha/pr115526.c
@@ -0,0 +1,46 @@
+/* PR target/115526 */
+/* { dg-do assemble } */
+/* { dg-options "-O2 -Wno-attributes -fvisibility=hidden -fPIC -mcpu=ev4" } */
+
+struct _ts {
+  struct _dtoa_state *interp;
+};
+struct Bigint {
+  int k;
+} *_Py_dg_strtod_bs;
+struct _dtoa_state {
+  struct Bigint p5s;
+  struct Bigint *freelist[];
+};
+extern _Thread_local struct _ts _Py_tss_tstate;
+typedef struct Bigint Bigint;
+int pow5mult_k;
+long _Py_dg_strtod_ndigits;
+void PyMem_Free();
+void Bfree(Bigint *v) {
+  if (v)
+{
+  if (v->k)
+   PyMem_Free();
+  else {
+   struct _dtoa_state *interp = _Py_tss_tstate.interp;
+   interp->freelist[v->k] = v;
+  }
+}
+}
+static Bigint *pow5mult(Bigint *b) {
+  for (;;) {
+if (pow5mult_k & 1) {
+  Bfree(b);
+  if (b == 0)
+return 0;
+}
+if (!(pow5mult_k >>= 1))
+  break;
+  }
+  return 0;
+}
+void _Py_dg_strtod() {
+  if (_Py_dg_strtod_ndigits)
+pow5mult(_Py_dg_strtod_bs);
+}

Re: [PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-17 Thread Uros Bizjak

On Wed, Jul 17, 2024 at 8:54 AM Liu, Hongtao  wrote:
>
>
>
> > -Original Message-
> > From: Uros Bizjak 
> > Sent: Wednesday, July 17, 2024 2:52 PM
> > To: Liu, Hongtao 
> > Cc: gcc-patches@gcc.gnu.org; crazy...@gmail.com; hjl.to...@gmail.com
> > Subject: Re: [PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1
> > in UNSPEC_MASKMOV
> >
> > On Wed, Jul 17, 2024 at 3:27 AM liuhongt  wrote:
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ready push to trunk.
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/115843
> > > * config/i386/predicates.md (const0_or_m1_operand): New
> > > predicate.
> > > * config/i386/sse.md (*_store_mask_1): New
> > > pre_reload define_insn_and_split.
> > > (V): Add V32BF,V16BF,V8BF.
> > > (V4SF_V8BF): Rename to ..
> > > (V24F_128): .. this.
> > > (*vec_concat): Adjust with V24F_128.
> > > (*vec_concat_0): Ditto.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/pr115843.c: New test.
> > > ---
> > >  gcc/config/i386/predicates.md|  5 
> > >  gcc/config/i386/sse.md   | 32 
> > >  gcc/testsuite/gcc.target/i386/pr115843.c | 38
> > > 
> > >  3 files changed, 69 insertions(+), 6 deletions(-)  create mode 100644
> > > gcc/testsuite/gcc.target/i386/pr115843.c
> > >
> > > diff --git a/gcc/config/i386/predicates.md
> > > b/gcc/config/i386/predicates.md index 5d0bb1e0f54..680594871de
> > 100644
> > > --- a/gcc/config/i386/predicates.md
> > > +++ b/gcc/config/i386/predicates.md
> > > @@ -825,6 +825,11 @@ (define_predicate "constm1_operand"
> > >(and (match_code "const_int")
> > > (match_test "op == constm1_rtx")))
> > >
> > > +;; Match 0 or -1.
> > > +(define_predicate "const0_or_m1_operand"
> > > +  (ior (match_operand 0 "const0_operand")
> > > +   (match_operand 0 "constm1_operand")))
> > > +
> > >  ;; Match exactly eight.
> > >  (define_predicate "const8_operand"
> > >(and (match_code "const_int")
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index
> > > e44822f705b..e11610f4b88 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -294,6 +294,7 @@ (define_mode_iterator V
> > > (V16SI "TARGET_AVX512F && TARGET_EVEX512") (V8SI "TARGET_AVX")
> > V4SI
> > > (V8DI "TARGET_AVX512F && TARGET_EVEX512")  (V4DI "TARGET_AVX")
> > V2DI
> > > (V32HF "TARGET_AVX512F && TARGET_EVEX512") (V16HF
> > "TARGET_AVX")
> > > V8HF
> > > +   (V32BF "TARGET_AVX512F && TARGET_EVEX512") (V16BF
> > "TARGET_AVX")
> > > + V8BF
> > > (V16SF "TARGET_AVX512F && TARGET_EVEX512") (V8SF "TARGET_AVX")
> > V4SF
> > > (V8DF "TARGET_AVX512F && TARGET_EVEX512")  (V4DF "TARGET_AVX")
> > > (V2DF "TARGET_SSE2")])
> > >
> > > @@ -430,8 +431,8 @@ (define_mode_iterator VFB_512
> > > (V16SF "TARGET_EVEX512")
> > > (V8DF "TARGET_EVEX512")])
> > >
> > > -(define_mode_iterator V4SF_V8HF
> > > -  [V4SF V8HF])
> > > +(define_mode_iterator V24F_128
> > > +  [V4SF V8HF V8BF])
> > >
> > >  (define_mode_iterator VI48_AVX512VL
> > >[(V16SI "TARGET_EVEX512") (V8SI "TARGET_AVX512VL") (V4SI
> > > "TARGET_AVX512VL") @@ -11543,8 +11544,8 @@ (define_insn
> > "*vec_concatv2sf_sse"
> > > (set_attr "mode" "V4SF,SF,DI,DI")])
> > >
> > >  (define_insn "*vec_concat"
> > > -  [(set (match_operand:V4SF_V8HF 0 "register_operand"   "=x,v,x,v")
> > > -   (vec_concat:V4SF_V8HF
> > > +  [(set (match_operand:V24F_128 0 "register_operand"   "=x,v,x,v")
> > > +   (vec_concat:V24F_128
> > >   (match_operand: 1 "register_operand" " 
> > > 0,v,0,v")
> > >   (match_operand: 2 "nonimm

Re: [PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-16 Thread Uros Bizjak

On Wed, Jul 17, 2024 at 3:27 AM liuhongt  wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready push to trunk.
>
> gcc/ChangeLog:
>
> PR target/115843
> * config/i386/predicates.md (const0_or_m1_operand): New
> predicate.
> * config/i386/sse.md (*_store_mask_1): New
> pre_reload define_insn_and_split.
> (V): Add V32BF,V16BF,V8BF.
> (V4SF_V8BF): Rename to ..
> (V24F_128): .. this.
> (*vec_concat): Adjust with V24F_128.
> (*vec_concat_0): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr115843.c: New test.
> ---
>  gcc/config/i386/predicates.md|  5 
>  gcc/config/i386/sse.md   | 32 
>  gcc/testsuite/gcc.target/i386/pr115843.c | 38 
>  3 files changed, 69 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115843.c
>
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 5d0bb1e0f54..680594871de 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -825,6 +825,11 @@ (define_predicate "constm1_operand"
>(and (match_code "const_int")
> (match_test "op == constm1_rtx")))
>
> +;; Match 0 or -1.
> +(define_predicate "const0_or_m1_operand"
> +  (ior (match_operand 0 "const0_operand")
> +   (match_operand 0 "constm1_operand")))
> +
>  ;; Match exactly eight.
>  (define_predicate "const8_operand"
>(and (match_code "const_int")
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index e44822f705b..e11610f4b88 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -294,6 +294,7 @@ (define_mode_iterator V
> (V16SI "TARGET_AVX512F && TARGET_EVEX512") (V8SI "TARGET_AVX") V4SI
> (V8DI "TARGET_AVX512F && TARGET_EVEX512")  (V4DI "TARGET_AVX") V2DI
> (V32HF "TARGET_AVX512F && TARGET_EVEX512") (V16HF "TARGET_AVX") V8HF
> +   (V32BF "TARGET_AVX512F && TARGET_EVEX512") (V16BF "TARGET_AVX") V8BF
> (V16SF "TARGET_AVX512F && TARGET_EVEX512") (V8SF "TARGET_AVX") V4SF
> (V8DF "TARGET_AVX512F && TARGET_EVEX512")  (V4DF "TARGET_AVX") (V2DF 
> "TARGET_SSE2")])
>
> @@ -430,8 +431,8 @@ (define_mode_iterator VFB_512
> (V16SF "TARGET_EVEX512")
> (V8DF "TARGET_EVEX512")])
>
> -(define_mode_iterator V4SF_V8HF
> -  [V4SF V8HF])
> +(define_mode_iterator V24F_128
> +  [V4SF V8HF V8BF])
>
>  (define_mode_iterator VI48_AVX512VL
>[(V16SI "TARGET_EVEX512") (V8SI "TARGET_AVX512VL") (V4SI "TARGET_AVX512VL")
> @@ -11543,8 +11544,8 @@ (define_insn "*vec_concatv2sf_sse"
> (set_attr "mode" "V4SF,SF,DI,DI")])
>
>  (define_insn "*vec_concat"
> -  [(set (match_operand:V4SF_V8HF 0 "register_operand"   "=x,v,x,v")
> -   (vec_concat:V4SF_V8HF
> +  [(set (match_operand:V24F_128 0 "register_operand"   "=x,v,x,v")
> +   (vec_concat:V24F_128
>   (match_operand: 1 "register_operand" " 0,v,0,v")
>   (match_operand: 2 "nonimmediate_operand" " 
> x,v,m,m")))]
>"TARGET_SSE"
> @@ -11559,8 +11560,8 @@ (define_insn "*vec_concat"
> (set_attr "mode" "V4SF,V4SF,V2SF,V2SF")])
>
>  (define_insn "*vec_concat_0"
> -  [(set (match_operand:V4SF_V8HF 0 "register_operand"   "=v")
> -   (vec_concat:V4SF_V8HF
> +  [(set (match_operand:V24F_128 0 "register_operand"   "=v")
> +   (vec_concat:V24F_128
>   (match_operand: 1 "nonimmediate_operand" "vm")
>   (match_operand: 2 "const0_operand")))]
>"TARGET_SSE2"
> @@ -28574,6 +28575,25 @@ (define_insn "_store_mask"
> (set_attr "memory" "store")
> (set_attr "mode" "")])
>
> +(define_insn_and_split "*_store_mask_1"
> +  [(set (match_operand:V 0 "memory_operand")
> +   (unspec:V
> + [(match_operand:V 1 "register_operand")
> +  (match_dup 0)
> +  (match_operand: 2 "const0_or_m1_operand")]
> + UNSPEC_MASKMOV))]
> +  "TARGET_AVX512F"

Please add "ix86_pre_reload_split ()" condition to insn constraint for
instructions that have to be split before reload.

Uros.

> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  if (constm1_operand (operands[2], mode))
> +  {
> +emit_move_insn (operands[0], operands[1]);
> +DONE;
> +  }
> +})
> +
>  (define_expand "cbranch4"
>[(set (reg:CC FLAGS_REG)
> (compare:CC (match_operand:VI_AVX_AVX512F 1 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/i386/pr115843.c 
> b/gcc/testsuite/gcc.target/i386/pr115843.c
> new file mode 100644
> index 000..00d8605757a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115843.c
> @@ -0,0 +1,38 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx512vl --param vect-partial-vector-usage=2 
> -mtune=znver5 -mprefer-vector-width=512" } */
> +/* { dg-final { scan-assembler-not "kxor\[bw]" } } */
> +
> +typedef unsigned long long BITBOARD;
> +BITBOARD KingPressureMask1[64], KingSafetyMask1[64];
> +
> +void __attribute__((noinline))
> +foo()
>

Re: [x86 PATCH] Tweak i386-expand.cc to restore bootstrap on RHEL.

2024-07-14 Thread Uros Bizjak

On Sun, Jul 14, 2024 at 3:42 PM Roger Sayle  wrote:
>
>
> This is a minor change to restore bootstrap on systems using gcc 4.8
> as a host compiler.  The fatal error is:
>
> In file included from gcc/gcc/coretypes.h:471:0,
>  from gcc/gcc/config/i386/i386-expand.cc:23:
> gcc/gcc/config/i386/i386-expand.cc: In function 'void
> ix86_expand_fp_absneg_operator(rtx_code, machine_mode, rtx_def**)':
> ./insn-modes.h:315:75: error: temporary of non-literal type
> 'scalar_float_mode' in a constant expression
>  #define HFmode (scalar_float_mode ((scalar_float_mode::from_int) E_HFmode))
>^
> gcc/gcc/config/i386/i386-expand.cc:2179:8: note: in expansion of macro
> 'HFmode'
>case HFmode:
> ^
>
>
> The solution is to use the E_?Fmode enumeration constants as case values
> in switch statements.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures (from this change).  Ok for mainline?
>
>
> 2024-07-14  Roger Sayle  
>
> * config/i386/i386-expand.cc (ix86_expand_fp_absneg_operator):
> Use E_?Fmode enumeration constants in switch statement.
> (ix86_expand_copysign): Likewise.
> (ix86_expand_xorsign): Likewise.

OK, also for backports.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

Re: [r15-1936 Regression] FAIL: gcc.target/i386/avx512vl-vpmovuswb-2.c execution test on Linux/x86_64

2024-07-10 Thread Uros Bizjak

On Wed, Jul 10, 2024 at 3:42 PM haochen.jiang
 wrote:
>
> On Linux/x86_64,
>
> 80e446e829d818dc19daa6e671b9626e93ee4949 is the first bad commit
> commit 80e446e829d818dc19daa6e671b9626e93ee4949
> Author: Pan Li 
> Date:   Fri Jul 5 20:36:35 2024 +0800
>
> Match: Support form 2 for the .SAT_TRUNC
>
> caused
>
> FAIL: gcc.target/i386/avx512f-vpmovusqb-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovusdb-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovusdw-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovusqb-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovusqd-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovusqw-2.c execution test
> FAIL: gcc.target/i386/avx512vl-vpmovuswb-2.c execution test

This is fixed by [1].

The consequence of a last-minute "impossible-to-fail-so-no-need-to-test" change.

Lesson learned.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-July/656898.html

Uros.

[committed] i386: Swap compare operands in ustrunc patterns

2024-07-10 Thread Uros Bizjak

A last minute change led to a wrong operand order in the  compare insn.

gcc/ChangeLog:

* config/i386/i386.md (ustruncdi2): Swap compare operands.
(ustruncsi2): Ditto.
(ustrunchiqi2): Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e2f30695d70..de9f4ba0496 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9990,7 +9990,7 @@ (define_expand "ustruncdi2"
   rtx sat = force_reg (DImode, GEN_INT (GET_MODE_MASK (mode)));
   rtx dst;
 
-  emit_insn (gen_cmpdi_1 (op1, sat));
+  emit_insn (gen_cmpdi_1 (sat, op1));
 
   if (TARGET_CMOVE)
 {
@@ -10026,7 +10026,7 @@ (define_expand "ustruncsi2"
   rtx sat = force_reg (SImode, GEN_INT (GET_MODE_MASK (mode)));
   rtx dst;
 
-  emit_insn (gen_cmpsi_1 (op1, sat));
+  emit_insn (gen_cmpsi_1 (sat, op1));
 
   if (TARGET_CMOVE)
 {
@@ -10062,7 +10062,7 @@ (define_expand "ustrunchiqi2"
   rtx sat = force_reg (HImode, GEN_INT (GET_MODE_MASK (QImode)));
   rtx dst;
 
-  emit_insn (gen_cmphi_1 (op1, sat));
+  emit_insn (gen_cmphi_1 (sat, op1));
 
   if (TARGET_CMOVE)
 {

[PATCH] middle-end: Fix stalled swapped condition code value [PR115836]

2024-07-10 Thread Uros Bizjak

emit_store_flag_1 calculates scode (swapped condition code) at the
beginning of the function from the value of code variable.  However,
code variable may change before scode usage site, resulting in
invalid stalled scode value.

Move calculation of scode value just before its only usage site to
avoid stalled scode value.

PR middle-end/115836

gcc/ChangeLog:

* expmed.cc (emit_store_flag_1): Move calculation of
scode just before its only usage site.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Also tested with original and minimized preprocessed source.
Unfortunately, even with the minimized source, the compilation takes
~5 minutes, and IMO such a trivial fix does not warrant that high
resource consumption.

OK for master and release branches?

Uros.
diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 8bbbc94a98c..154964bd068 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -5632,11 +5632,9 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
   enum insn_code icode;
   machine_mode compare_mode;
   enum mode_class mclass;
-  enum rtx_code scode;
 
   if (unsignedp)
 code = unsigned_condition (code);
-  scode = swap_condition (code);
 
   /* If one operand is constant, make it the second one.  Only do this
  if the other operand is not constant as well.  */
@@ -5751,6 +5749,8 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
 
  if (GET_MODE_CLASS (mode) == MODE_FLOAT)
{
+ enum rtx_code scode = swap_condition (code);
+
  tem = emit_cstore (target, icode, scode, mode, compare_mode,
 unsignedp, op1, op0, normalizep, target_mode);
  if (tem)

Re: [PATCH] [alpha] adjust MEM alignment for block move [PR115459] (was: Re: [PATCH v2] [PR100106] Reject unaligned subregs when strict alignment is required)

2024-07-10 Thread Uros Bizjak

On Thu, Jun 13, 2024 at 9:37 AM Alexandre Oliva  wrote:
>
> Hello, Maciej,
>
> On Jun 12, 2024, "Maciej W. Rozycki"  wrote:
>
> >  This has regressed building the `alpha-linux-gnu' target, in libada, as
> > from commit d6b756447cd5 including GCC 14 and up to current GCC 15 trunk:
>
> > | Error detected around g-debpoo.adb:1896:8|
>
> > I have filed PR #115459.
>
> Thanks!
>
> This was tricky to duplicate without access to an alpha-linux-gnu
> machine.  I ended up building an uberbaum tree with --disable-shared
> --disable-threads --enable-languages=ada up to all-target-libgcc, then I
> replaced gcc/collect2 with a wrapper script that dropped crt[1in].o and
> -lc, so that link tests in libada/configure would succeed without glibc
> for the target.  libada still wouldn't build, because of the missing
> glibc headers, but I could compile g-depboo.adb with -I pointing at a
> x86_64-linux-gnu's gcc/ada/rts build tree, and with that, at -O2, I
> could trigger the problem and investigate it.  And with the following
> patch, the problem seems to be gone.
>
> Maciej, would you be so kind as to give it a spin with a native
> regstrap?  TIA,
>
> Richard, is this ok to install if regstrapping succeeds?
>
>
> Before issuing loads or stores for a block move, adjust the MEM
> alignments if analysis of the addresses enabled the inference of
> stricter alignment.  This ensures that the MEMs are sufficiently
> aligned for the corresponding insns, which avoids trouble in case of
> e.g. substitutions into SUBREGs.
>
>
> for  gcc/ChangeLog
>
> PR target/115459
> * config/alpha/alpha.cc (alpha_expand_block_move): Adjust
> MEMs to match inferred alignment.

LGTM, based on a successful bootstrap/regtest report down the reply thread.

Thanks,
Uros.

> ---
>  gcc/config/alpha/alpha.cc |   12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/gcc/config/alpha/alpha.cc b/gcc/config/alpha/alpha.cc
> index 1126cea1f7ba2..e090e74b9d073 100644
> --- a/gcc/config/alpha/alpha.cc
> +++ b/gcc/config/alpha/alpha.cc
> @@ -3820,6 +3820,12 @@ alpha_expand_block_move (rtx operands[])
>else if (a >= 16 && c % 2 == 0)
> src_align = 16;
> }
> +
> +  if (MEM_P (orig_src) && MEM_ALIGN (orig_src) < src_align)
> +   {
> + orig_src = shallow_copy_rtx (orig_src);
> + set_mem_align (orig_src, src_align);
> +   }
>  }
>
>tmp = XEXP (orig_dst, 0);
> @@ -3841,6 +3847,12 @@ alpha_expand_block_move (rtx operands[])
>else if (a >= 16 && c % 2 == 0)
> dst_align = 16;
> }
> +
> +  if (MEM_P (orig_dst) && MEM_ALIGN (orig_dst) < dst_align)
> +   {
> + orig_dst = shallow_copy_rtx (orig_dst);
> + set_mem_align (orig_dst, dst_align);
> +   }
>  }
>
>ofs = 0;
>
>
> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> More tolerance and less prejudice are key for inclusion and diversity
> Excluding neuro-others for not behaving ""normal"" is *not* inclusive

[committed] i386: Implement .SAT_TRUNC for unsigned integers

2024-07-09 Thread Uros Bizjak

The following testcase:

unsigned short foo (unsigned int x)
{
  _Bool overflow = x > (unsigned int)(unsigned short)(-1);
  return ((unsigned short)x | (unsigned short)-overflow);
}

currently compiles (-O2) to:

foo:
xorl%eax, %eax
cmpl$65535, %edi
seta%al
negl%eax
orl%edi, %eax
ret

We can expand through ustrunc{m}{n}2 optab to use carry flag from the
comparison and generate code using SBB:

foo:
cmpl$65535, %edi
sbbl%eax, %eax
orl%edi, %eax
ret

or CMOV instruction:

foo:
movl$65535, %eax
cmpl%eax, %edi
cmovnc%edi, %eax
ret

gcc/ChangeLog:

* config/i386/i386.md (@cmp_1): Use SWI mode iterator.
(ustruncdi2): New expander.
(ustruncsi2): Ditto.
(ustrunchiqi2): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/sattrunc-1.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 214cb2e239a..e2f30695d70 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1533,8 +1533,8 @@ (define_insn "@ccmp"
 
 (define_expand "@cmp_1"
   [(set (reg:CC FLAGS_REG)
-   (compare:CC (match_operand:SWI48 0 "nonimmediate_operand")
-   (match_operand:SWI48 1 "")))])
+   (compare:CC (match_operand:SWI 0 "nonimmediate_operand")
+   (match_operand:SWI 1 "")))])
 
 (define_mode_iterator SWI1248_AVX512BWDQ_64
   [(QI "TARGET_AVX512DQ") HI
@@ -9981,6 +9981,114 @@ (define_expand "ussub3"
   DONE;
 })
 
+(define_expand "ustruncdi2"
+  [(set (match_operand:SWI124 0 "register_operand")
+   (us_truncate:DI (match_operand:DI 1 "nonimmediate_operand")))]
+  "TARGET_64BIT"
+{
+  rtx op1 = force_reg (DImode, operands[1]);
+  rtx sat = force_reg (DImode, GEN_INT (GET_MODE_MASK (mode)));
+  rtx dst;
+
+  emit_insn (gen_cmpdi_1 (op1, sat));
+
+  if (TARGET_CMOVE)
+{
+  rtx cmp = gen_rtx_GEU (VOIDmode, gen_rtx_REG (CCCmode, FLAGS_REG),
+const0_rtx);
+
+  dst = force_reg (mode, operands[0]);
+  emit_insn (gen_movsicc (gen_lowpart (SImode, dst), cmp,
+ gen_lowpart (SImode, op1),
+ gen_lowpart (SImode, sat)));
+}
+  else
+{
+  rtx msk = gen_reg_rtx (mode);
+
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  dst = expand_simple_binop (mode, IOR,
+gen_lowpart (mode, op1), msk,
+operands[0], 1, OPTAB_WIDEN);
+}
+
+  if (!rtx_equal_p (dst, operands[0]))
+emit_move_insn (operands[0], dst);
+  DONE;
+})
+
+(define_expand "ustruncsi2"
+  [(set (match_operand:SWI12 0 "register_operand")
+   (us_truncate:SI (match_operand:SI 1 "nonimmediate_operand")))]
+  ""
+{
+  rtx op1 = force_reg (SImode, operands[1]);
+  rtx sat = force_reg (SImode, GEN_INT (GET_MODE_MASK (mode)));
+  rtx dst;
+
+  emit_insn (gen_cmpsi_1 (op1, sat));
+
+  if (TARGET_CMOVE)
+{
+  rtx cmp = gen_rtx_GEU (VOIDmode, gen_rtx_REG (CCCmode, FLAGS_REG),
+const0_rtx);
+
+  dst = force_reg (mode, operands[0]);
+  emit_insn (gen_movsicc (gen_lowpart (SImode, dst), cmp,
+ gen_lowpart (SImode, op1),
+ gen_lowpart (SImode, sat)));
+}
+  else
+{
+  rtx msk = gen_reg_rtx (mode);
+
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  dst = expand_simple_binop (mode, IOR,
+gen_lowpart (mode, op1), msk,
+operands[0], 1, OPTAB_WIDEN);
+}
+
+  if (!rtx_equal_p (dst, operands[0]))
+emit_move_insn (operands[0], dst);
+  DONE;
+})
+
+(define_expand "ustrunchiqi2"
+  [(set (match_operand:QI 0 "register_operand")
+   (us_truncate:HI (match_operand:HI 1 "nonimmediate_operand")))]
+  ""
+{
+  rtx op1 = force_reg (HImode, operands[1]);
+  rtx sat = force_reg (HImode, GEN_INT (GET_MODE_MASK (QImode)));
+  rtx dst;
+
+  emit_insn (gen_cmphi_1 (op1, sat));
+
+  if (TARGET_CMOVE)
+{
+  rtx cmp = gen_rtx_GEU (VOIDmode, gen_rtx_REG (CCCmode, FLAGS_REG),
+const0_rtx);
+
+  dst = force_reg (QImode, operands[0]);
+  emit_insn (gen_movsicc (gen_lowpart (SImode, dst), cmp,
+ gen_lowpart (SImode, op1),
+ gen_lowpart (SImode, sat)));
+}
+  else
+{
+  rtx msk = gen_reg_rtx (QImode);
+
+  emit_insn (gen_x86_movqicc_0_m1_neg (msk));
+  dst = expand_simple_binop (QImode, IOR,
+gen_lowpart (QImode, op1), msk,
+operands[0], 1, OPTAB_WIDEN);
+}
+
+  if (!rtx_equal_p (dst, operands[0]))
+emit_move_insn (operands[0], dst);
+  DONE;
+})
+
 ;; The patterns that match these are at the end of this file.
 
 (define_expand "xf3"
diff --git a/gcc/testsuite/gcc.target/i386/sattrunc-1.c 
b/gcc/testsuite/gcc

Re: [PATCH] i386: Correct AVX10 CPUID emulation

2024-07-09 Thread Uros Bizjak

On Tue, Jul 9, 2024 at 10:38 AM Haochen Jiang  wrote:
>
> Hi all,
>
> AVX10 Documentaion has specified ecx value as 0 for AVX10 version and
> vector size under 0x24 subleaf. Although for ecx=1, the bits are all
> reserved for now, we still need to specify ecx as 0 to avoid dirty
> value in ecx.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk and backport to GCC14?
>
> Reference:
>
> Intel Advanced Vector Extensions 10 (Intel AVX10) Architecture Specification
>
> https://cdrdv2.intel.com/v1/dl/getContent/784267
>
> It describes the Intel Advanced Vector Extensions 10 Instruction Set 
> Architecture.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features): Correct
> AVX10 CPUID emulation to specify ecx value.

OK.

Thanks,
Uros.

> ---
>  gcc/common/config/i386/cpuinfo.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 936039725ab..2ae77d335d2 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -998,10 +998,10 @@ get_available_features (struct __processor_model 
> *cpu_model,
> }
>  }
>
> -  /* Get Advanced Features at level 0x24 (eax = 0x24).  */
> +  /* Get Advanced Features at level 0x24 (eax = 0x24, ecx = 0).  */
>if (avx10_set && max_cpuid_level >= 0x24)
>  {
> -  __cpuid (0x24, eax, ebx, ecx, edx);
> +  __cpuid_count (0x24, 0, eax, ebx, ecx, edx);
>version = ebx & 0xff;
>if (ebx & bit_AVX10_256)
> switch (version)
> --
> 2.31.1
>

[committed] i386: Promote {QI, HI}mode x86_movcc_0_m1_neg to SImode

2024-07-08 Thread Uros Bizjak

Promote HImode x86_movcc_0_m1_neg insn to SImode to avoid
redundant prefixes. Also promote QImode insn when TARGET_PROMOTE_QImode
is set. This is similar to promotable_binary_operator splitter, where we
promote the result to SImode.

Also correct insn condition for splitters to SImode of NEG and NOT
instructions. The sizes of QImode and SImode instructions are always
the same, so there is no need for optimize_insn_for_size bypass.

gcc/ChangeLog:

* config/i386/i386.md (x86_movcc_0_m1_neg splitter to SImode):
New splitter.
(NEG and NOT splitter to SImode): Remove optimize_insn_for_size_p
predicate from insn condition.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b24c4fe5875..214cb2e239a 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -26576,9 +26576,7 @@ (define_split
(clobber (reg:CC FLAGS_REG))]
   "! TARGET_PARTIAL_REG_STALL && reload_completed
&& (GET_MODE (operands[0]) == HImode
-   || (GET_MODE (operands[0]) == QImode
-  && (TARGET_PROMOTE_QImode
-  || optimize_insn_for_size_p ("
+   || (GET_MODE (operands[0]) == QImode && TARGET_PROMOTE_QImode))"
   [(parallel [(set (match_dup 0)
   (neg:SI (match_dup 1)))
  (clobber (reg:CC FLAGS_REG))])]
@@ -26593,15 +26591,30 @@ (define_split
(not (match_operand 1 "general_reg_operand")))]
   "! TARGET_PARTIAL_REG_STALL && reload_completed
&& (GET_MODE (operands[0]) == HImode
-   || (GET_MODE (operands[0]) == QImode
-  && (TARGET_PROMOTE_QImode
-  || optimize_insn_for_size_p ("
+   || (GET_MODE (operands[0]) == QImode && TARGET_PROMOTE_QImode))"
   [(set (match_dup 0)
(not:SI (match_dup 1)))]
 {
   operands[0] = gen_lowpart (SImode, operands[0]);
   operands[1] = gen_lowpart (SImode, operands[1]);
 })
+
+(define_split
+  [(set (match_operand 0 "general_reg_operand")
+   (neg (match_operator 1 "ix86_carry_flag_operator"
+ [(reg FLAGS_REG) (const_int 0)])))
+   (clobber (reg:CC FLAGS_REG))]
+  "! TARGET_PARTIAL_REG_STALL && reload_completed
+   && (GET_MODE (operands[0]) == HImode
+   || (GET_MODE (operands[0]) == QImode && TARGET_PROMOTE_QImode))"
+  [(parallel [(set (match_dup 0)
+  (neg:SI (match_dup 1)))
+ (clobber (reg:CC FLAGS_REG))])]
+{
+  operands[0] = gen_lowpart (SImode, operands[0]);
+  operands[1] = shallow_copy_rtx (operands[1]);
+  PUT_MODE (operands[1], SImode);
+})
 
 ;; RTL Peephole optimizations, run before sched2.  These primarily look to
 ;; transform a complex memory operation into two memory to register operations.

Re: [PATCH v2] i386: Refactor ssedoublemode

2024-07-05 Thread Uros Bizjak

On Fri, Jul 5, 2024 at 9:07 AM Hu, Lin1  wrote:
>
> I Modified the changelog and comments.
>
> ssedoublemode's double should mean double type, like SI -> DI.
> And we need to refactor some patterns with  instead of
> .
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (ssedoublemode): Remove mappings to double
>   of elements and mapping vector mode to the same number of
>   double sized elements.

Better write: "Remove mappings to twice the number of same-sized
elements.  Add mappings to the same number of double-sized elements."

>   (define_split for vec_concat_minus_plus): Change mode_attr from
>   ssedoublemode to ssedoublevecmode.
>   (define_split for vec_concat_plus_minus): Ditto.
>   (avx512dq_shuf_64x2_1):
>   Ditto.
>   (avx512f_shuf_64x2_1): Ditto.
>   (avx512vl_shuf_32x4_1): Ditto.
>   (avx512f_shuf_32x4_1): Ditto.

OK with the above ChangeLog adjustment.

Thanks,
Uros.

> ---
>  gcc/config/i386/sse.md | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index d71b0f2567e..bda66d5e121 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -808,13 +808,12 @@ (define_mode_attr ssedoublemodelower
> (V8HI "v8si")   (V16HI "v16si") (V32HI "v32si")
> (V4SI "v4di")   (V8SI "v8di")   (V16SI "v16di")])
>
> +;; Map vector mode to the same number of double sized elements.
>  (define_mode_attr ssedoublemode
> -  [(V4SF "V8SF") (V8SF "V16SF") (V16SF "V32SF")
> -   (V2DF "V4DF") (V4DF "V8DF") (V8DF "V16DF")
> +  [(V4SF "V4DF") (V8SF "V8DF") (V16SF "V16DF")
> (V16QI "V16HI") (V32QI "V32HI") (V64QI "V64HI")
> (V8HI "V8SI") (V16HI "V16SI") (V32HI "V32SI")
> -   (V4SI "V4DI") (V8SI "V16SI") (V16SI "V32SI")
> -   (V4DI "V8DI") (V8DI "V16DI")])
> +   (V4SI "V4DI") (V8SI "V8DI") (V16SI "V16DI")])
>
>  (define_mode_attr ssebytemode
>[(V8DI "V64QI") (V4DI "V32QI") (V2DI "V16QI")
> @@ -3319,7 +3318,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (minus:VF_128_256
>(match_operand:VF_128_256 1 "register_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -3353,7 +3352,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (plus:VF_128_256
>(match_operand:VF_128_256 1 "vector_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -19869,7 +19868,7 @@ (define_expand "avx512dq_shuf_64x2_mask"
>  (define_insn "avx512dq_shuf_64x2_1"
>[(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
> (vec_select:VI8F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI8F_256 1 "register_operand" "x,v")
> (match_operand:VI8F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_3_operand")
> @@ -19922,7 +19921,7 @@ (define_expand "avx512f_shuf_64x2_mask"
>  (define_insn "avx512f_shuf_64x2_1"
>[(set (match_operand:V8FI 0 "register_operand" "=v")
> (vec_select:V8FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V8FI 1 "register_operand" "v")
> (match_operand:V8FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20020,7 +20019,7 @@ (define_expand "avx512vl_shuf_32x4_mask"
>  (define_insn "avx512vl_shuf_32x4_1"
>[(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
> (vec_select:VI4F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI4F_256 1 "register_operand" "x,v")
> (match_operand:VI4F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20091,7 +20090,7 @@ (define_expand "avx512f_shuf_32x4_mask"
>  (define_insn "avx512f_shuf_32x4_1"
>[(set (match_operand:V16FI 0 "register_operand" "=v")
> (vec_select:V16FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V16FI 1 "register_operand" "v")
> (match_operand:V16FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_15_operand")
> --
> 2.31.1
>

Re: [PATCH] i386: Refactor ssedoublemode

2024-07-04 Thread Uros Bizjak

On Fri, Jul 5, 2024 at 7:48 AM Hu, Lin1  wrote:
>
> Hi, all
>
> ssedoublemode's double should mean double type, like SI -> DI.
> And we need to refactor some patterns with  instead of
> .
>
> Bootstrapped and regtested on x86-64-linux-gnu, OK for trunk?
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (ssedoublemode): Fix the mode_attr.

Please be more descriptive, like ": Remove mappings to double of
elements". Please also add names of changed patterns to ChangeLog.

> ---
>  gcc/config/i386/sse.md | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index d71b0f2567e..d06ce94fa55 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -808,13 +808,12 @@ (define_mode_attr ssedoublemodelower
> (V8HI "v8si")   (V16HI "v16si") (V32HI "v32si")
> (V4SI "v4di")   (V8SI "v8di")   (V16SI "v16di")])
>
> +;; ssedoublemode means vector mode with sanme number of double-size.

Better say: Map vector mode to the same number of double sized elements.

Uros.

>  (define_mode_attr ssedoublemode
> -  [(V4SF "V8SF") (V8SF "V16SF") (V16SF "V32SF")
> -   (V2DF "V4DF") (V4DF "V8DF") (V8DF "V16DF")
> +  [(V4SF "V4DF") (V8SF "V8DF") (V16SF "V16DF")
> (V16QI "V16HI") (V32QI "V32HI") (V64QI "V64HI")
> (V8HI "V8SI") (V16HI "V16SI") (V32HI "V32SI")
> -   (V4SI "V4DI") (V8SI "V16SI") (V16SI "V32SI")
> -   (V4DI "V8DI") (V8DI "V16DI")])
> +   (V4SI "V4DI") (V8SI "V8DI") (V16SI "V16DI")])
>
>  (define_mode_attr ssebytemode
>[(V8DI "V64QI") (V4DI "V32QI") (V2DI "V16QI")
> @@ -3319,7 +3318,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (minus:VF_128_256
>(match_operand:VF_128_256 1 "register_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -3353,7 +3352,7 @@ (define_split
>  (define_split
>[(set (match_operand:VF_128_256 0 "register_operand")
> (match_operator:VF_128_256 7 "addsub_vs_operator"
> - [(vec_concat:
> + [(vec_concat:
>  (plus:VF_128_256
>(match_operand:VF_128_256 1 "vector_operand")
>(match_operand:VF_128_256 2 "vector_operand"))
> @@ -19869,7 +19868,7 @@ (define_expand "avx512dq_shuf_64x2_mask"
>  (define_insn "avx512dq_shuf_64x2_1"
>[(set (match_operand:VI8F_256 0 "register_operand" "=x,v")
> (vec_select:VI8F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI8F_256 1 "register_operand" "x,v")
> (match_operand:VI8F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_3_operand")
> @@ -19922,7 +19921,7 @@ (define_expand "avx512f_shuf_64x2_mask"
>  (define_insn "avx512f_shuf_64x2_1"
>[(set (match_operand:V8FI 0 "register_operand" "=v")
> (vec_select:V8FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V8FI 1 "register_operand" "v")
> (match_operand:V8FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20020,7 +20019,7 @@ (define_expand "avx512vl_shuf_32x4_mask"
>  (define_insn "avx512vl_shuf_32x4_1"
>[(set (match_operand:VI4F_256 0 "register_operand" "=x,v")
> (vec_select:VI4F_256
> - (vec_concat:
> + (vec_concat:
> (match_operand:VI4F_256 1 "register_operand" "x,v")
> (match_operand:VI4F_256 2 "nonimmediate_operand" "xjm,vm"))
>   (parallel [(match_operand 3 "const_0_to_7_operand")
> @@ -20091,7 +20090,7 @@ (define_expand "avx512f_shuf_32x4_mask"
>  (define_insn "avx512f_shuf_32x4_1"
>[(set (match_operand:V16FI 0 "register_operand" "=v")
> (vec_select:V16FI
> - (vec_concat:
> + (vec_concat:
> (match_operand:V16FI 1 "register_operand" "v")
> (match_operand:V16FI 2 "nonimmediate_operand" "vm"))
>   (parallel [(match_operand 3 "const_0_to_15_operand")
> --
> 2.31.1
>

Re: [x86 PATCH] Add additional variant of bswaphisi2_lowpart peephole2.

2024-07-01 Thread Uros Bizjak

On Mon, Jul 1, 2024 at 3:20 PM Roger Sayle  wrote:
>
>
> This patch adds an additional variation of the peephole2 used to convert
> bswaphisi2_lowpart into rotlhi3_1_slp, which converts xchgb %ah,%al into
> rotw if the flags register isn't live.  The motivating example is:
>
> void ext(int x);
> void foo(int x)
> {
>   ext((x&~0x)|((x>>8)&0xff)|((x&0xff)<<8));
> }
>
> where GCC with -O2 currently produces:
>
> foo:movl%edi, %eax
> rolw$8, %ax
> movl%eax, %edi
> jmp ext
>
> The issue is that the original xchgb (bswaphisi2_lowpart) can only be
> performed in "Q" registers that allow the %?h register to be used, so
> reload generates the above two movl.  However, it's later in peephole2
> where we see that CC_FLAGS can be clobbered, so we can use a rotate word,
> which is more forgiving with register allocations.  With the additional
> peephole2 proposed here, we now generate:
>
> foo:rolw$8, %di
> jmp ext
>
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-07-01  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (bswaphisi2_lowpart peephole2): New
> peephole2 variant to eliminate register shuffling.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/xchg-4.c: New test case.

OK.

Thanks,
Uros.

>
>
> Thanks again,
> Roger
> --
>

Re: [x86 PATCH]: Additional peephole2 to use lea in round-up integer division.

2024-06-30 Thread Uros Bizjak

On Sun, Jun 30, 2024 at 9:09 PM Roger Sayle  wrote:
>
>
> Hi Uros,
> > On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle 
> > wrote:
> > > A common idiom for implementing an integer division that rounds
> > > upwards is to write (x + y - 1) / y.  Conveniently on x86, the two
> > > additions to form the numerator can be performed by a single lea
> > > instruction, and indeed gcc currently generates a lea when x and y both
> > registers.
> > >
> > > int foo(int x, int y) {
> > >   return (x+y-1)/y;
> > > }
> > >
> > > generates with -O2:
> > >
> > > foo:leal-1(%rsi,%rdi), %eax // 4 bytes
> > > cltd
> > > idivl   %esi
> > > ret
> > >
> > > Oddly, however, if x is a memory, gcc currently uses two instructions:
> > >
> > > int m;
> > > int bar(int y) {
> > >   return (m+y-1)/y;
> > > }
> > >
> > > generates:
> > >
> > > foo:movlm(%rip), %eax
> > > addl%edi, %eax  // 2 bytes
> > > subl$1, %eax// 3 bytes
> > > cltd
> > > idivl   %edi
> > > ret
> > >
> > > This discrepancy is caused by the late decision (in peephole2) to
> > > split an addition with a memory operand, into a load followed by a
> > > reg-reg addition.  This patch improves this situation by adding a
> > > peephole2 to recognized consecutive additions and transform them into
> > > lea if profitable.
> > >
> > > My first attempt at fixing this was to use a define_insn_and_split:
> > >
> > > (define_insn_and_split "*lea3_reg_mem_imm"
> > >   [(set (match_operand:SWI48 0 "register_operand")
> > >(plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
> > >(match_operand:SWI48 2 "memory_operand"))
> > >(match_operand:SWI48 3 "x86_64_immediate_operand")))]
> > >   "ix86_pre_reload_split ()"
> > >   "#"
> > >   "&& 1"
> > >   [(set (match_dup 4) (match_dup 2))
> > >(set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
> > >  (match_dup 3)))]
> > >   "operands[4] = gen_reg_rtx (mode);")
> > >
> > > using combine to combine instructions.  Unfortunately, this approach
> > > interferes with (reload's) subtle balance of deciding when to
> > > use/avoid lea, which can be observed as a code size regression in
> > > CSiBE.  The peephole2 approach (proposed here) uniformly improves CSiBE
> > results.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2024-06-29  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386.md (peephole2): Transform two consecutive
> > > additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
> > >
> > > gcc/testsuite/ChageLog
> > > * gcc.target/i386/lea-3.c: New test case.
> >
> > Is the assumption that one LEA is always faster than two ADD instructions
> > universally correct for TARGET_AVOID_LEA_FOR_ADDR?
> >
> > Please note ix86_lea_outperforms predicate and its uses in
> > ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
> > ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
> > !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly.
> >
> > The function comment of avoid_lea_for_addr() says:
> >
> > /* Return true if we need to split lea into a sequence of
> >instructions to avoid AGU stalls during peephole2. */
> >
> > And your peephole tries to reverse the above split.
>
> I completely agree that understanding when/why i386.md converts
> an lea into a sequence of additions (and avoiding reversing this split)
> is vitally important to understanding my patch.  You're quite right that
> the logic governing this ultimately calls ix86_lea_outperforms, but as
> I'll explain below the shape of those APIs (requiring an insn) is not as
> convenient for instruction merging as for splitting.
>
> The current location in i386.md where it decides whether the
> lea in the foo example above needs to be split, is at line 6293:
>
> (define_peephole2
>   [(set (match_operand:SWI48 0 "register_operand")
> (match_operand:SWI48 1 "address_no_seg_operand"))]
>   "ix86_hardreg_mov_ok (operands[0], operands[1])
>&& peep2_regno_dead_p (0, FLAGS_REG)
>&& ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)"
> ...
>
> Hence, we transform lea->add+add when ix86_avoid_lea_for_addr
> returns true, so by symmetry is not unreasonable to turn add+add->lea
> when ix86_avoid_lea_for_addr would return false.  The relevant part
> of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc:
>
>   /* Check we need to optimize.  */
>   if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
> return false;
>
> which you'll recognize is precisely the condition under which my
> proposed peephole2 fires.  Technically, we also know that this is
> a 3-component lea, "p

Re: [x86 PATCH]: Additional peephole2 to use lea in round-up integer division.

2024-06-30 Thread Uros Bizjak

On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle  wrote:
>
>
> A common idiom for implementing an integer division that rounds upwards is
> to write (x + y - 1) / y.  Conveniently on x86, the two additions to form
> the numerator can be performed by a single lea instruction, and indeed gcc
> currently generates a lea when x and y both registers.
>
> int foo(int x, int y) {
>   return (x+y-1)/y;
> }
>
> generates with -O2:
>
> foo:leal-1(%rsi,%rdi), %eax // 4 bytes
> cltd
> idivl   %esi
> ret
>
> Oddly, however, if x is a memory, gcc currently uses two instructions:
>
> int m;
> int bar(int y) {
>   return (m+y-1)/y;
> }
>
> generates:
>
> foo:movlm(%rip), %eax
> addl%edi, %eax  // 2 bytes
> subl$1, %eax// 3 bytes
> cltd
> idivl   %edi
> ret
>
> This discrepancy is caused by the late decision (in peephole2) to split
> an addition with a memory operand, into a load followed by a reg-reg
> addition.  This patch improves this situation by adding a peephole2
> to recognized consecutive additions and transform them into lea if
> profitable.
>
> My first attempt at fixing this was to use a define_insn_and_split:
>
> (define_insn_and_split "*lea3_reg_mem_imm"
>   [(set (match_operand:SWI48 0 "register_operand")
>(plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
>(match_operand:SWI48 2 "memory_operand"))
>(match_operand:SWI48 3 "x86_64_immediate_operand")))]
>   "ix86_pre_reload_split ()"
>   "#"
>   "&& 1"
>   [(set (match_dup 4) (match_dup 2))
>(set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
>  (match_dup 3)))]
>   "operands[4] = gen_reg_rtx (mode);")
>
> using combine to combine instructions.  Unfortunately, this approach
> interferes with (reload's) subtle balance of deciding when to use/avoid lea,
> which can be observed as a code size regression in CSiBE.  The peephole2
> approach (proposed here) uniformly improves CSiBE results.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-06-29  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (peephole2): Transform two consecutive
> additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
>
> gcc/testsuite/ChageLog
> * gcc.target/i386/lea-3.c: New test case.

Is the assumption that one LEA is always faster than two ADD
instructions universally correct for TARGET_AVOID_LEA_FOR_ADDR?

Please note ix86_lea_outperforms predicate and its uses in
ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
!avoid_lea_for_addr() should be used here, but I didn't check it
thoroughly.

The function comment of avoid_lea_for_addr() says:

/* Return true if we need to split lea into a sequence of
   instructions to avoid AGU stalls during peephole2. */

And your peephole tries to reverse the above split.

Uros.

>
>
> Thanks in advance,
> Roger
> --
>

[PATCH] i386: Cleanup tmp variable usage in ix86_expand_move

2024-06-28 Thread Uros Bizjak

Remove extra assignment, extra temp variable and variable shadowing.

No functional changes intended.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_move): Remove extra
assignment to tmp variable, reuse tmp variable instead of
declaring new temporary variable and remove tmp variable shadowing.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Also built crosscompiler to x86_64-pc-cygwin and x86_64-apple-darwin16.

Uros.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a4434c19272..a773b45bf03 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -414,9 +414,6 @@ ix86_expand_move (machine_mode mode, rtx operands[])
{
 #if TARGET_PECOFF
  tmp = legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
-#else
- tmp = NULL_RTX;
-#endif
 
  if (tmp)
{
@@ -425,6 +422,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
break;
}
  else
+#endif
{
  op1 = operands[1];
  break;
@@ -482,12 +480,12 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  /* dynamic-no-pic */
  if (MACHOPIC_INDIRECT)
{
- rtx temp = (op0 && REG_P (op0) && mode == Pmode)
-? op0 : gen_reg_rtx (Pmode);
- op1 = machopic_indirect_data_reference (op1, temp);
+ tmp = (op0 && REG_P (op0) && mode == Pmode)
+   ? op0 : gen_reg_rtx (Pmode);
+ op1 = machopic_indirect_data_reference (op1, tmp);
  if (MACHOPIC_PURE)
op1 = machopic_legitimize_pic_address (op1, mode,
-  temp == op1 ? 0 : temp);
+  tmp == op1 ? 0 : tmp);
}
  if (op0 != op1 && GET_CODE (op0) != MEM)
{
@@ -542,9 +540,9 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  op1 = validize_mem (force_const_mem (mode, op1));
  if (!register_operand (op0, mode))
{
- rtx temp = gen_reg_rtx (mode);
- emit_insn (gen_rtx_SET (temp, op1));
- emit_move_insn (op0, temp);
+ tmp = gen_reg_rtx (mode);
+ emit_insn (gen_rtx_SET (tmp, op1));
+ emit_move_insn (op0, tmp);
  return;
}
}
@@ -565,7 +563,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   if (SUBREG_BYTE (op0) == 0)
{
  wide_int mask = wi::mask (64, true, 128);
- rtx tmp = immed_wide_int_const (mask, TImode);
+ tmp = immed_wide_int_const (mask, TImode);
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)
@@ -577,7 +575,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   else if (SUBREG_BYTE (op0) == 8)
{
  wide_int mask = wi::mask (64, false, 128);
- rtx tmp = immed_wide_int_const (mask, TImode);
+ tmp = immed_wide_int_const (mask, TImode);
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)

Re: [PATCH] i386: Fix regression after refactoring legitimize_pe_coff_symbol, ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL

2024-06-28 Thread Uros Bizjak

On Fri, Jun 28, 2024 at 1:41 PM Evgeny Karpov
 wrote:
>
> Thursday, June 27, 2024 8:13 PM
> Uros Bizjak  wrote:
>
> >
> > So, there is no problem having #endif just after else.
> >
> > Anyway, it's your call, this is not a hill I'm willing to die on. ;)
> >
> > Thanks,
> > Uros.
>
> It looks like the patch resolves 3 reported issues.
> Uros, I suggest merging the patch as it is, without minor refactoring, to 
> avoid triggering another round of testing, if you agree.

Yes, please go ahead.

Thanks,
Uros.

Re: [PATCH 3/3] [x86] Enable flate-combine.

2024-06-27 Thread Uros Bizjak

On Fri, Jun 28, 2024 at 7:29 AM liuhongt  wrote:
>
> Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
> define target_insn_cost to prevent post_reload pass_late_combine to
> revert the optimziation did in pass_rpad.
>
> Adjust testcases since pass_late_combine generates better code but
> break scan assembly.
>
> .i.e
> Under 32-bit target, gcc used to generate broadcast from stack and
> then do the real operation.
> After flate_combine, they're combined into embeded broadcast
> operations.
>
> gcc/ChangeLog:
>
> * config/i386/i386-features.cc (ix86_rpad_gate): New function.
> * config/i386/i386-options.cc (ix86_override_options_after_change):
> Don't disable flate_combine.
> * config/i386/i386-passes.def: Move pass_stv2 and pass_rpad
> after pre_reload pas_late_combine.
> * config/i386/i386-protos.h (ix86_rpad_gate): New declare.
> * config/i386/i386.cc (ix86_insn_cost): New function.
> (TARGET_INSN_COST): Define.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Adjus
> testcase.
> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Ditto.
> * gcc.target/i386/avx512f-fmadd-sf-zmm-7.c: Ditto.
> * gcc.target/i386/avx512f-fmsub-sf-zmm-7.c: Ditto.
> * gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c: Ditto.
> * gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c: Ditto.
> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Ditto.
> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Ditto.
> * gcc.target/i386/pr91333.c: Ditto.
> * gcc.target/i386/vect-strided-4.c: Ditto.

LGTM.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-features.cc   | 16 +++-
>  gcc/config/i386/i386-options.cc|  4 
>  gcc/config/i386/i386-passes.def|  4 ++--
>  gcc/config/i386/i386-protos.h  |  1 +
>  gcc/config/i386/i386.cc| 18 ++
>  .../i386/avx512f-broadcast-pr87767-1.c |  4 ++--
>  .../i386/avx512f-broadcast-pr87767-5.c |  1 -
>  .../gcc.target/i386/avx512f-fmadd-sf-zmm-7.c   |  2 +-
>  .../gcc.target/i386/avx512f-fmsub-sf-zmm-7.c   |  2 +-
>  .../gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c  |  2 +-
>  .../gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c  |  2 +-
>  .../i386/avx512vl-broadcast-pr87767-1.c|  4 ++--
>  .../i386/avx512vl-broadcast-pr87767-5.c|  2 --
>  gcc/testsuite/gcc.target/i386/pr91333.c|  2 +-
>  gcc/testsuite/gcc.target/i386/vect-strided-4.c |  2 +-
>  15 files changed, 42 insertions(+), 24 deletions(-)
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index 607d1991460..fc224ed06b0 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -2995,6 +2995,16 @@ make_pass_insert_endbr_and_patchable_area 
> (gcc::context *ctxt)
>return new pass_insert_endbr_and_patchable_area (ctxt);
>  }
>
> +bool
> +ix86_rpad_gate ()
> +{
> +  return (TARGET_AVX
> + && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> + && TARGET_SSE_MATH
> + && optimize
> + && optimize_function_for_speed_p (cfun));
> +}
> +
>  /* At entry of the nearest common dominator for basic blocks with
> conversions/rcp/sqrt/rsqrt/round, generate a single
> vxorps %xmmN, %xmmN, %xmmN
> @@ -3232,11 +3242,7 @@ public:
>/* opt_pass methods: */
>bool gate (function *) final override
>  {
> -  return (TARGET_AVX
> - && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> - && TARGET_SSE_MATH
> - && optimize
> - && optimize_function_for_speed_p (cfun));
> +  return ix86_rpad_gate ();
>  }
>
>unsigned int execute (function *) final override
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 9c12d498928..1ef2c71a7a2 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1944,10 +1944,6 @@ ix86_override_options_after_change (void)
> flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
>  }
>
> -  /* Late combine tends to undo some of the effects of STV and RPAD,
> - by combining instructions back to their original form.  */
> -  if (!OPTION_SET_P (flag_late_combine_instructions))
> -flag_late_combine_instructions = 0;
>  }
>
>  /* Clear stack slot assignments remembered from previous functions.
> diff --git a/gcc/config/i386/i386-passes.def b/gcc/config/i386/i386-passes.def
> index 7d96766f7b9..2d29f65da88 100644
> --- a/gcc/config/i386/i386-passes.def
> +++ b/gcc/config/i386/i386-passes.def
> @@ -25,11 +25,11 @@ along with GCC; see the file COPYING3.  If not see
>   */
>
>INSERT_PASS_AFTER (pass_postreload_cse, 1, pass_insert_vzeroupper);
> -  INSERT_PASS_AFTER (pass_combine, 1, pass_stv, false /* timode_p */);
> +  INSERT_PASS_AFTER (pass_late_combine, 1, pass_stv, false /* t

Re: [PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative.

2024-06-27 Thread Uros Bizjak

On Fri, Jun 28, 2024 at 7:29 AM liuhongt  wrote:
>
> late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
> cause extra mov between gpr and kmask, add ?k to the pattern.
>
> gcc/ChangeLog:
>
> PR target/115610
> * config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
> enable it only for lshiftrt and under avx512bw.
> * config/i386/sse.md (*klshrsi3_1_zext): New define_insn, and
> add corresponding define_split after it.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386.md | 19 +--
>  gcc/config/i386/sse.md  | 28 
>  2 files changed, 41 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index fd48e764469..57a10c1af48 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -16836,10 +16836,10 @@ (define_insn "*bmi2_si3_1_zext"
> (set_attr "mode" "SI")])
>
>  (define_insn "*si3_1_zext"
> -  [(set (match_operand:DI 0 "register_operand" "=r,r,r")
> +  [(set (match_operand:DI 0 "register_operand" "=r,r,r,?k")
> (zero_extend:DI
> - (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" 
> "0,rm,rm")
> - (match_operand:QI 2 "nonmemory_operand" 
> "cI,r,cI"
> + (any_shiftrt:SI (match_operand:SI 1 "nonimmediate_operand" 
> "0,rm,rm,k")
> + (match_operand:QI 2 "nonmemory_operand" 
> "cI,r,cI,I"
> (clobber (reg:CC FLAGS_REG))]
>"TARGET_64BIT
> && ix86_binary_operator_ok (, SImode, operands, TARGET_APX_NDD)"
> @@ -16850,6 +16850,8 @@ (define_insn "*si3_1_zext"
>  case TYPE_ISHIFTX:
>return "#";
>
> +case TYPE_MSKLOG:
> +  return "#";
>  default:
>if (operands[2] == const1_rtx
>   && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun))
> @@ -16860,8 +16862,8 @@ (define_insn "*si3_1_zext"
>: "{l}\t{%2, %k0|%k0, %2}";
>  }
>  }
> -  [(set_attr "isa" "*,bmi2,apx_ndd")
> -   (set_attr "type" "ishift,ishiftx,ishift")
> +  [(set_attr "isa" "*,bmi2,apx_ndd,avx512bw")
> +   (set_attr "type" "ishift,ishiftx,ishift,msklog")
> (set (attr "length_immediate")
>   (if_then_else
> (and (match_operand 2 "const1_operand")
> @@ -16869,7 +16871,12 @@ (define_insn "*si3_1_zext"
>  (match_test "optimize_function_for_size_p (cfun)")))
> (const_string "0")
> (const_string "*")))
> -   (set_attr "mode" "SI")])
> +   (set_attr "mode" "SI")
> +   (set (attr "enabled")
> +   (if_then_else
> + (eq_attr "alternative" "3")
> + (symbol_ref " == LSHIFTRT && TARGET_AVX512BW")
> + (const_string "*")))])
>
>  ;; Convert shift to the shiftx pattern to avoid flags dependency.
>  (define_split
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 0be2dcd8891..20665a6f097 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -2179,6 +2179,34 @@ (define_split
>  (match_dup 2)))
>(unspec [(const_int 0)] UNSPEC_MASKOP)])])
>
> +(define_insn "*klshrsi3_1_zext"
> +  [(set (match_operand:DI 0 "register_operand" "=k")
> +   (zero_extend:DI
> + (lshiftrt:SI (match_operand:SI 1 "register_operand" "k")
> +  (match_operand 2 "const_0_to_31_operand" "I"
> +  (unspec [(const_int 0)] UNSPEC_MASKOP)]
> +  "TARGET_AVX512BW"
> +  "kshiftrd\t{%2, %1, %0|%0, %1, %2}"
> +[(set_attr "type" "msklog")
> +   (set_attr "prefix" "vex")
> +   (set_attr "mode" "SI")])
> +
> +(define_split
> +  [(set (match_operand:DI 0 "mask_reg_operand")
> +   (zero_extend:DI
> + (lshiftrt:SI
> +   (match_operand:SI 1 "mask_reg_operand")
> +   (match_operand 2 "const_0_to_31_operand"
> +(clobber (reg:CC FLAGS_REG))]
> +  "TARGET_AVX512BW && reload_completed"
> +  [(parallel
> + [(set (match_dup 0)
> +  (zero_extend:DI
> +(lshiftrt:SI
> +  (match_dup 1)
> +  (match_dup 2
> +  (unspec [(const_int 0)] UNSPEC_MASKOP)])])
> +
>  (define_insn "ktest"
>[(set (reg:CC FLAGS_REG)
> (unspec:CC
> --
> 2.31.1
>

Re: [x86 PATCH] Handle sign_extend like zero_extend in *concatditi3_[346]

2024-06-27 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 9:40 PM Roger Sayle  wrote:
>
>
> This patch generalizes some of the patterns in i386.md that recognize
> double word concatenation, so they handle sign_extend the same way that
> they handle zero_extend in appropriate contexts.
>
> As a motivating example consider the following function:
>
> __int128 foo(long long x, unsigned long long y)
> {
>   return ((__int128)x<<64) | y;
> }
>
> when compiled with -O2, x86_64 currently generates:
>
> foo:movq%rdi, %rdx
> xorl%eax, %eax
> xorl%edi, %edi
> orq %rsi, %rax
> orq %rdi, %rdx
> ret
>
> with this patch we now generate (the same as if x is unsigned):
>
> foo:movq%rsi, %rax
> movq%rdi, %rdx
> ret
>
> Treating both extensions the same way using any_extend is valid as
> the top (extended) bits are "unused" after the shift by 64 (or more).
> In theory, the RTL optimizers might consider canonicalizing the form
> of extension used in these cases, but zero_extend is faster on some
> machines, whereas sign extension is supported via addressing modes on
> others, so handling both in the machine description is probably best.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-06-27  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.md (*concat3_3): Change zero_extend
> to any_extend in first operand to left shift by mode precision.
> (*concat3_4): Likewise.
> (*concat3_6): Likewise.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/concatditi-1.c: New test case.

OK.

Thanks,
Uros.

>
>
> Thanks in advance,
> Roger
> --
>

Re: [PATCH] i386: Fix regression after refactoring legitimize_pe_coff_symbol, ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL

2024-06-27 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 12:50 PM Evgeny Karpov
 wrote:
>
> Thursday, June 27, 2024 10:39 AM
> Uros Bizjak  wrote:
>
> > > diff --git a/gcc/config/i386/i386-expand.cc 
> > > b/gcc/config/i386/i386-expand.cc
> > > index 5dfa7d49f58..20adb42e17b 100644
> > > --- a/gcc/config/i386/i386-expand.cc
> > > +++ b/gcc/config/i386/i386-expand.cc
> > > @@ -414,6 +414,10 @@ ix86_expand_move (machine_mode mode, rtx
> > operands[])
> > >   {
> > >  #if TARGET_PECOFF
> > >tmp = legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
> > > +#else
> > > +tmp = NULL_RTX;
> > > +#endif
> > > +
> > >if (tmp)
> > >  {
> > >op1 = tmp;
> > > @@ -425,7 +429,6 @@ ix86_expand_move (machine_mode mode, rtx
> > operands[])
> > >op1 = operands[1];
> > >break;
> > >  }
> > > -#endif
> > >   }
> > >
> > >if (addend)
> >
> > tmp can only be set by legitimize_pe_coff_symbol, so !TARGET_PECOFF
> > will always get to the "else" part. Do this change simply by moving
> > #endif, like the below:
> >
> > --cut here--
> > iff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 5dfa7d49f58..407db6c215b 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -421,11 +421,11 @@ ix86_expand_move (machine_mode mode, rtx
> > operands[])
> >break;
> >}
> >  else
> > +#endif
> >{
> >  op1 = operands[1];
> >  break;
> >}
> > -#endif
> >}
> >
> >   if (addend)
> > --cut here--
> >
>
> I would prefer readability in the original version if there are no objections.

The proposed form is how existing TARGET_MACHO handles similar issue.
Please see e.g. i386.cc, around line 6216 and elsewhere:

#if TARGET_MACHO
  if (TARGET_MACHO)
{
  switch_to_section (darwin_sections[picbase_thunk_section]);
  fputs ("\t.weak_definition\t", asm_out_file);
  assemble_name (asm_out_file, name);
  fputs ("\n\t.private_extern\t", asm_out_file);
  assemble_name (asm_out_file, name);
  putc ('\n', asm_out_file);
  ASM_OUTPUT_LABEL (asm_out_file, name);
  DECL_WEAK (decl) = 1;
}
  else
#endif
if (USE_HIDDEN_LINKONCE)
...

So, there is no problem having #endif just after else.

Anyway, it's your call, this is not a hill I'm willing to die on. ;)

Thanks,
Uros.

Re: [PATCH] libgccjit: Add support for machine-dependent builtins

2024-06-27 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 12:49 AM David Malcolm  wrote:
>
> On Thu, 2023-11-23 at 17:17 -0500, Antoni Boucher wrote:
> > Hi.
> > I did split the patch and sent one for the bfloat16 support and
> > another
> > one for the vector support.
> >
> > Here's the updated patch for the machine-dependent builtins.
> >
>
> Thanks for the patch; sorry about the long delay in reviewing it.
>
> CCing Jan and Uros re the i386 part of that patch; for reference the
> patch being discussed is here:
>   https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638027.html
>
> > From e025f95f4790ae861e709caf23cbc0723c1a3804 Mon Sep 17 00:00:00 2001
> > From: Antoni Boucher 
> > Date: Mon, 23 Jan 2023 17:21:15 -0500
> > Subject: [PATCH] libgccjit: Add support for machine-dependent builtins
>
> [...snip...]
>
> > diff --git a/gcc/config/i386/i386-builtins.cc 
> > b/gcc/config/i386/i386-builtins.cc
> > index 42fc3751676..5cc1d6f4d2e 100644
> > --- a/gcc/config/i386/i386-builtins.cc
> > +++ b/gcc/config/i386/i386-builtins.cc
> > @@ -225,6 +225,22 @@ static GTY(()) tree ix86_builtins[(int) 
> > IX86_BUILTIN_MAX];
> >
> >  struct builtin_isa ix86_builtins_isa[(int) IX86_BUILTIN_MAX];
> >
> > +static void
> > +clear_builtin_types (void)
> > +{
> > +  for (int i = 0 ; i < IX86_BT_LAST_CPTR + 1 ; i++)
> > +ix86_builtin_type_tab[i] = NULL;
> > +
> > +  for (int i = 0 ; i < IX86_BUILTIN_MAX ; i++)
> > +  {
> > +ix86_builtins[i] = NULL;
> > +ix86_builtins_isa[i].set_and_not_built_p = true;
> > +  }
> > +
> > +  for (int i = 0 ; i < IX86_BT_LAST_ALIAS + 1 ; i++)
> > +ix86_builtin_func_type_tab[i] = NULL;
> > +}
> > +
> >  tree get_ix86_builtin (enum ix86_builtins c)
> >  {
> >return ix86_builtins[c];
> > @@ -1483,6 +1499,8 @@ ix86_init_builtins (void)
> >  {
> >tree ftype, decl;
> >
> > +  clear_builtin_types ();
> > +
> >ix86_init_builtin_types ();
> >
> >/* Builtins to get CPU type and features. */
>
> Please can one of the i386 maintainers check this?
> (CCing Jan and Uros: this is for the case where the compiler code runs
> multiple times in-process due to being linked into libgccjit.so.  We
> want to restore state within i386-builtins.cc to an initial state, and
> ensure that no GC-managed objects persist from previous in-memory
> compiles).

Can we rather introduce TARGET_CLEANUP_BUILTINS hook and call it from
the JIT compiler at some appropriate time? IMO, this burdens
unnecessarily non-JIT compilation.

Uros.

Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-27 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 9:01 AM Li, Pan2  wrote:
>
> It only requires the backend implement the standard name for vector mode I 
> bet.

There are several standard names present for x86:
{ss,us}{add,sub}{v8qi,v16qi,v32qi,v64qi,v4hi,v8hi,v16hi,v32hi},
defined in sse.md:

(define_expand "3"
  [(set (match_operand:VI12_AVX2_AVX512BW 0 "register_operand")
(sat_plusminus:VI12_AVX2_AVX512BW
  (match_operand:VI12_AVX2_AVX512BW 1 "vector_operand")
  (match_operand:VI12_AVX2_AVX512BW 2 "vector_operand")))]
  "TARGET_SSE2 &&  && "
  "ix86_fixup_binary_operands_no_copy (, mode, operands);")

but all of these handle only 8 and 16 bit elements.

> How about a simpler one like below.
>
>   #define DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(OUT_T, IN_T)   \
>   void __attribute__((noinline))   \
>   vec_sat_u_sub_trunc_##OUT_T##_fmt_1 (OUT_T *out, IN_T *op_1, IN_T y, \
>unsigned limit) \
>   {\
> unsigned i;\
> for (i = 0; i < limit; i++)\
>   {\
> IN_T x = op_1[i];  \
> out[i] = (OUT_T)(x >= y ? x - y : 0);  \
>   }\
>   }
>
> DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(uint32_t, uint64_t);

I tried with:

DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(uint8_t, uint16_t);

And the compiler was able to detect several .SAT_SUB patterns:

$ grep SAT_SUB pr51492-1.c.266t.optimized
 vect_patt_37.14_85 = .SAT_SUB (vect_x_13.12_81, vect_cst__84);
 vect_patt_37.14_86 = .SAT_SUB (vect_x_13.13_83, vect_cst__84);
 vect_patt_42.26_126 = .SAT_SUB (vect_x_62.24_122, vect_cst__125);
 vect_patt_42.26_127 = .SAT_SUB (vect_x_62.25_124, vect_cst__125);
 iftmp.0_24 = .SAT_SUB (x_3, y_14(D));

Uros.

>
> The riscv backend is able to detect the pattern similar as below. I can help 
> to check x86 side after the running test suites.
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   if (limit_11(D) != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> ;;succ:   3
> ;;5
> ;;   basic block 3, loop depth 0
> ;;pred:   2
>   vect_cst__71 = [vec_duplicate_expr] y_14(D);
>   _78 = (unsigned long) limit_11(D);
> ;;succ:   4
>
> ;;   basic block 4, loop depth 1
> ;;pred:   4
> ;;3
>   # vectp_op_1.7_68 = PHI 
>   # vectp_out.12_75 = PHI 
>   # ivtmp_79 = PHI 
>   _81 = .SELECT_VL (ivtmp_79, POLY_INT_CST [2, 2]);
>   ivtmp_67 = _81 * 8;
>   vect_x_13.9_70 = .MASK_LEN_LOAD (vectp_op_1.7_68, 64B, { -1, ... }, _81, 0);
>   vect_patt_48.10_72 = .SAT_SUB (vect_x_13.9_70, vect_cst__71);   
>// .SAT_SUB pattern
>   vect_patt_49.11_73 = (vector([2,2]) unsigned int) vect_patt_48.10_72;
>   ivtmp_74 = _81 * 4;
>   .MASK_LEN_STORE (vectp_out.12_75, 32B, { -1, ... }, _81, 0, 
> vect_patt_49.11_73);
>   vectp_op_1.7_69 = vectp_op_1.7_68 + ivtmp_67;
>   vectp_out.12_76 = vectp_out.12_75 + ivtmp_74;
>   ivtmp_80 = ivtmp_79 - _81;
>
> riscv64-unknown-elf-gcc (GCC) 15.0.0 20240627 (experimental)
> Copyright (C) 2024 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
> Pan
>
> -Original Message-
> From: Uros Bizjak 
> Sent: Thursday, June 27, 2024 2:48 PM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> richard.guent...@gmail.com; jeffreya...@gmail.com; pins...@gmail.com
> Subject: Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip
>
> On Mon, Jun 24, 2024 at 3:55 PM  wrote:
> >
> > From: Pan Li 
> >
> > The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> > truncated as below:
> >
> > void test (uint16_t *x, unsigned b, unsigned n)
> > {
> >   unsigned a = 0;
> >   register uint16_t *p = x;
> >
> >   do {
> > a = *--p;
> > *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
> >   } while (--n);
> > }
> >

No, the current compiler does not recognize .SAT_SUB for x86 with the
above code, although many vector sat sub instructions involving 16bit
elements are present.

Uros.

Re: [PATCH] i386: Fix regression after refactoring legitimize_pe_coff_symbol, ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL

2024-06-27 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 9:16 AM Evgeny Karpov
 wrote:
>
> Thank you for reporting the issues and discussing the root causes.
> It helped in preparing the patch.
>
> This patch fixes 3 bugs reported after merging
> the "Add DLL import/export implementation to AArch64" series.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653955.html
> The series refactors the i386 codebase to reuse it in AArch64, which
> triggers some bugs.
>
> Bug 115661 - [15 Regression] wrong code at -O{2,3} on
> x86_64-linux-gnu since r15-1599-g63512c72df09b4
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115661
>
> Bug 115635 - [15 regression] Bootstrap fails with failed
> self-test with the rust fe (diagnostic-path.cc:1153:
> test_empty_path: FAIL: ASSERT_FALSE
> ((path.interprocedural_p ( since r15-1599-g63512c72df09b4
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115635
>
> Issue 1. In some code, i386 has been relying on the
> legitimize_pe_coff_symbol call on all platforms and should return
> NULL_RTX if it is not supported.
>
> Fix: NULL_RTX handling has been added when the target does not
> support PECOFF.
>
> Issue 2. ix86_GOT_alias_set is used on all platforms and cannot be
> extracted to mingw.
>
> Fix: ix86_GOT_alias_set has been returned as it was and is used on
> all platforms for i386.
>
> Bug 115643 - [15 regression] aarch64-w64-mingw32 support today breaks
> x86_64-w64-mingw32 build cannot represent relocation type
> BFD_RELOC_64 since r15-1602-ged20feebd9ea31
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115643
>
> Issue 3. PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED has been added and used
> with a negative operator for a complex expression without braces.
>
> Fix: Braces has been added, and
> PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED has been renamed to
> PE_COFF_LEGITIMIZE_EXTERN_DECL.
>
>
> The patch has been attached as a text file because it contains special
> characters that are usually removed by the mail client.

> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 5dfa7d49f58..20adb42e17b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -414,6 +414,10 @@ ix86_expand_move (machine_mode mode, rtx operands[])
>   {
>  #if TARGET_PECOFF
>tmp = legitimize_pe_coff_symbol (op1, addend != NULL_RTX);
> +#else
> +tmp = NULL_RTX;
> +#endif
> +
>if (tmp)
>  {
>op1 = tmp;
> @@ -425,7 +429,6 @@ ix86_expand_move (machine_mode mode, rtx operands[])
>op1 = operands[1];
>break;
>  }
> -#endif
>   }
>
>if (addend)

tmp can only be set by legitimize_pe_coff_symbol, so !TARGET_PECOFF
will always get to the "else" part. Do this change simply by moving
#endif, like the below:

--cut here--
iff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 5dfa7d49f58..407db6c215b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -421,11 +421,11 @@ ix86_expand_move (machine_mode mode, rtx operands[])
   break;
   }
 else
+#endif
   {
 op1 = operands[1];
 break;
   }
-#endif
   }

  if (addend)
--cut here--

Side note, legitimize_pe_coff_symbol is always called from #if
TARGET_PECOFF, so:

rtx
legitimize_pe_coff_symbol (rtx addr, bool inreg)
{
  if (!TARGET_PECOFF)
return NULL_RTX;

should be removed or converted to gcc_assert.

> +alias_set_type
> +ix86_GOT_alias_set (void)
> +{
> +  static alias_set_type set = -1;

Please add a line of vertical space here.

> +  if (set == -1)
> +set = new_alias_set ();
> +  return set;

OK, but please allow RichartB to look at the alias_set changes.

Thanks,
Uros.

Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Uros Bizjak

On Mon, Jun 24, 2024 at 3:55 PM  wrote:
>
> From: Pan Li 
>
> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.

I have tried this patch with x86_64 on the testcase from PR51492, but
the compiler does not recognize the .SAT_SUB pattern here.

Is there anything else missing for successful detection?

Uros.

>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..4a4b0b2e72f 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
>
>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..3d887d36050 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> +  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
> (stmt_info)));
> +
> +  *type_out = v_otype;
>
> -  *type_out = vtype;
> +  if (types_compatible_p (itype, otype))
> +   return call;
> +  else
> +   {
> + append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
> + tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
>
> -  return call;
> + return gimple_build_assign (out_ssa, CONVERT_EXPR, in_ssa);
> +   }
>  }
>
>return NULL;
> @@ -4541,13 +4552,13 @@ vect_recog_sat

Re: [PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Uros Bizjak

On Thu, Jun 27, 2024 at 5:57 AM liuhongt  wrote:
>
> > But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> > recursive processing at any level.  You're dealing with MEM [addr]
> > here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> > the best way to deal with this?  Since this is the MEM [addr] case
> > we know it's not LEA, no?
> The patch restrict MEM rtx_cost reduction only for register_operand + disp.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

LGTM.

Thanks,
Uros.

>
>
> 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> It is the case in the PR, the patch adjust rtx_cost to only handle reg
> + disp, for other forms, they're basically all LEA which doesn't have
> additional cost of ADD.
>
> gcc/ChangeLog:
>
> PR target/115462
> * config/i386/i386.cc (ix86_rtx_costs): Make cost of MEM (reg +
> disp) just a little bit more than MEM (reg).
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/pr115462.c: New test.
> ---
>  gcc/config/i386/i386.cc  |  5 -
>  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
>  2 files changed, 26 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ccc24be6e..ef2a1e4f4f2 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22339,7 +22339,10 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>  address_cost should be used, but it reduce cost too much.
>  So current solution is make constant disp as cheap as possible.  
> */
>   if (GET_CODE (addr) == PLUS
> - && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> + && x86_64_immediate_operand (XEXP (addr, 1), Pmode)
> + /* Only hanlde (reg + disp) since other forms of addr are 
> mostly LEA,
> +there's no additional cost for the plus of disp.  */
> + && register_operand (XEXP (addr, 0), Pmode))
> {
>   *total += 1;
>   *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
> diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> b/gcc/testsuite/gcc.target/i386/pr115462.c
> new file mode 100644
> index 000..ad50a6382bc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 
> } } */
> +
> +int
> +foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, 
> long n, int* q)
> +{
> +  static int p1[1];
> +  int* p2 = p1 + 1000;
> +  int* p3 = p1 + 4000;
> +  int* p4 = p1 + 8000;
> +
> +  for (long i = 0; i != n; i++)
> +{
> +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> +p1.0+3996 should be propagted into the loop.  */
> +  p2[indx++] = q[indx++];
> +  p3[indx2++] = q[indx2++];
> +  p4[indx3++] = q[indx3++];
> +}
> +  return p1[indx6] + p1[indx5];
> +}
> --
> 2.31.1
>

Re: [PATCH] i386: Fix some ISA bit test in option_override

2024-06-19 Thread Uros Bizjak

On Thu, Jun 20, 2024 at 3:16 AM Hongyu Wang  wrote:
>
> Hi,
>
> This patch adjusts several new feature check in ix86_option_override_interal
> that directly use TARGET_* instead of TARGET_*_P (opts->ix86_isa_flags),
> which caused cmdline option overrides target_attribute isa flag.
>
> Bootstrapped && regtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386-options.cc (ix86_option_override_internal):
> Use TARGET_*_P (opts->x_ix86_isa_flags*) instead of TARGET_*
> for UINTR, LAM and APX_F.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-ccmp-2.c: Remove -mno-apxf in option.
> * gcc.target/i386/funcspec-56.inc: Drop uintr tests.
> * gcc.target/i386/funcspec-6.c: Add uintr tests.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-options.cc   | 14 +-
>  gcc/testsuite/gcc.target/i386/apx-ccmp-2.c|  2 +-
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  2 --
>  gcc/testsuite/gcc.target/i386/funcspec-6.c|  2 ++
>  4 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index f2cecc0e254..34adedb3127 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -2113,15 +2113,18 @@ ix86_option_override_internal (bool main_args_p,
>opts->x_ix86_stringop_alg = no_stringop;
>  }
>
> -  if (TARGET_APX_F && !TARGET_64BIT)
> +  if (TARGET_APX_F_P (opts->x_ix86_isa_flags2)
> +  && !TARGET_64BIT_P (opts->x_ix86_isa_flags))
>  error ("%<-mapxf%> is not supported for 32-bit code");
> -  else if (opts->x_ix86_apx_features != apx_none && !TARGET_64BIT)
> +  else if (opts->x_ix86_apx_features != apx_none
> +  && !TARGET_64BIT_P (opts->x_ix86_isa_flags))
>  error ("%<-mapx-features=%> option is not supported for 32-bit code");
>
> -  if (TARGET_UINTR && !TARGET_64BIT)
> +  if (TARGET_UINTR_P (opts->x_ix86_isa_flags2)
> +  && !TARGET_64BIT_P (opts->x_ix86_isa_flags))
>  error ("%<-muintr%> not supported for 32-bit code");
>
> -  if (ix86_lam_type && !TARGET_LP64)
> +  if (ix86_lam_type && !TARGET_LP64_P (opts->x_ix86_isa_flags))
>  error ("%<-mlam=%> option: [u48|u57] not supported for 32-bit code");
>
>if (!opts->x_ix86_arch_string)
> @@ -2502,7 +2505,8 @@ ix86_option_override_internal (bool main_args_p,
>init_machine_status = ix86_init_machine_status;
>
>/* Override APX flag here if ISA bit is set.  */
> -  if (TARGET_APX_F && !OPTION_SET_P (ix86_apx_features))
> +  if (TARGET_APX_F_P (opts->x_ix86_isa_flags2)
> +  && !OPTION_SET_P (ix86_apx_features))
>  opts->x_ix86_apx_features = apx_all;
>
>/* Validate -mregparm= value.  */
> diff --git a/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c 
> b/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> index 4a0784394c3..192c0458728 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> @@ -1,6 +1,6 @@
>  /* { dg-do run { target { ! ia32 } } } */
>  /* { dg-require-effective-target apxf } */
> -/* { dg-options "-O3 -mno-apxf" } */
> +/* { dg-options "-O3" } */
>
>  __attribute__((noinline, noclone, target("apxf")))
>  int foo_apx(int a, int b, int c, int d)
> diff --git a/gcc/testsuite/gcc.target/i386/funcspec-56.inc 
> b/gcc/testsuite/gcc.target/i386/funcspec-56.inc
> index 2a50f5bf67c..8825e88768a 100644
> --- a/gcc/testsuite/gcc.target/i386/funcspec-56.inc
> +++ b/gcc/testsuite/gcc.target/i386/funcspec-56.inc
> @@ -69,7 +69,6 @@ extern void test_avx512vp2intersect (void)
> __attribute__((__target__("avx512vp2i
>  extern void test_amx_tile (void)   
> __attribute__((__target__("amx-tile")));
>  extern void test_amx_int8 (void)   
> __attribute__((__target__("amx-int8")));
>  extern void test_amx_bf16 (void)   
> __attribute__((__target__("amx-bf16")));
> -extern void test_uintr (void)  
> __attribute__((__target__("uintr")));
>  extern void test_hreset (void) 
> __attribute__((__target__("hreset")));
>  extern void test_keylocker (void)  
> __attribute__((__target__("kl")));
>  extern void test_widekl (void) 
> __attribute__((__target__("widekl")));
> @@ -158,7 +157,6 @@ extern void test_no_avx512vp2intersect (void)   
> __attribute__((__target__("no-avx5
>  extern void test_no_amx_tile (void)
> __attribute__((__target__("no-amx-tile")));
>  extern void test_no_amx_int8 (void)
> __attribute__((__target__("no-amx-int8")));
>  extern void test_no_amx_bf16 (void)
> __attribute__((__target__("no-amx-bf16")));
> -extern void test_no_uintr (void)   
> __attribute__((__target__("no-uintr")));
>  extern void test_no_hreset (void)  
> __attribute__((__target__("no-hreset")));
>  extern void test_no_keylocker (void)   
> __attribute__((__target__("no-kl")));
>  extern void test_no

Re: [PATCH] [x86_64]: Zhaoxin shijidadao enablement

2024-06-19 Thread Uros Bizjak

On Tue, Jun 18, 2024 at 9:21 AM mayshao-oc  wrote:
>
>
>
> On 5/28/24 14:15, Uros Bizjak wrote:
> >
> >
> >
> > On Mon, May 27, 2024 at 10:33 AM MayShao  wrote:
> >>
> >> From: mayshao 
> >>
> >> Hi all:
> >>  This patch enables -march/-mtune=shijidadao, costs and tunings are 
> >> set according to the characteristics of the processor.
> >>
> >>  Bootstrapped /regtested X86_64.
> >>
> >>  Ok for trunk?
> >
> > OK.
> >
> > Thanks,
> > Uros.
>
> Thanks for your review, please help me commit.

Done, committed as r15-1454 [1].

[1] https://gcc.gnu.org/pipermail/gcc-cvs/2024-June/404474.html

Thanks,
Uros.

Re: [PATCH] [APX CCMP] Use ctestcc when comparing to const 0

2024-06-12 Thread Uros Bizjak

On Thu, Jun 13, 2024 at 3:44 AM Hongyu Wang  wrote:
>
> Thanks for the advice, updated patch in attachment.
>
> Bootstrapped/regtested on x86-64-pc-linux-gnu. Ok for trunk?
>
> Uros Bizjak  于2024年6月12日周三 18:12写道：
> >
> > On Wed, Jun 12, 2024 at 12:00 PM Uros Bizjak  wrote:
> > >
> > > On Wed, Jun 12, 2024 at 5:12 AM Hongyu Wang  wrote:
> > > >
> > > > Hi,
> > > >
> > > > For CTEST, we don't have conditional AND so there's no optimization
> > > > opportunity to write a new ctest pattern. Emit ctest when ccmp did
> > > > comparison to const 0 to save bytes.
> > > >
> > > > Bootstrapped & regtested under x86-64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * config/i386/i386.md (@ccmp): Use ctestcc when
> > > > operands[3] is const0_rtx.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * gcc.target/i386/apx-ccmp-1.c: Adjust output to scan ctest.
> > > > * gcc.target/i386/apx-ccmp-2.c: Adjust some condition to
> > > > compare with 0.

LGTM.

+  (minus:SWI (match_operand:SWI 2 "nonimmediate_operand" ",m,")
+ (match_operand:SWI 3 "" "C,,"))

Perhaps the constraint can be slightly optimized to avoid repeating
(,) pairs.

",m,"
"C  ,,"

Uros.

> > > > ---
> > > >  gcc/config/i386/i386.md|  6 +-
> > > >  gcc/testsuite/gcc.target/i386/apx-ccmp-1.c | 10 ++
> > > >  gcc/testsuite/gcc.target/i386/apx-ccmp-2.c |  4 ++--
> > > >  3 files changed, 13 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > > > index a64f2ad4f5f..014d48cddd6 100644
> > > > --- a/gcc/config/i386/i386.md
> > > > +++ b/gcc/config/i386/i386.md
> > > > @@ -1522,7 +1522,11 @@ (define_insn "@ccmp"
> > > >   [(match_operand:SI 4 "const_0_to_15_operand")]
> > > >   UNSPEC_APX_DFV)))]
> > > >   "TARGET_APX_CCMP"
> > > > - "ccmp%C1{}\t%G4 {%3, %2|%2, %3}"
> > > > + {
> > > > +   if (operands[3] == const0_rtx && !MEM_P (operands[2]))
> > > > + return "ctest%C1{}\t%G4 %2, %2";
> > > > +   return "ccmp%C1{}\t%G4 {%3, %2|%2, %3}";
> > > > + }
> > >
> > > This could be implemented as an alternative using "r,C" constraint as
> > > the first constraint for operands[2,3]. Then the register allocator
> > > will match the constraints for you.
> >
> > Like in the attached (lightly tested) patch.
> >
> > Uros.

Re: [PATCH] [APX CCMP] Use ctestcc when comparing to const 0

2024-06-12 Thread Uros Bizjak

On Wed, Jun 12, 2024 at 12:00 PM Uros Bizjak  wrote:
>
> On Wed, Jun 12, 2024 at 5:12 AM Hongyu Wang  wrote:
> >
> > Hi,
> >
> > For CTEST, we don't have conditional AND so there's no optimization
> > opportunity to write a new ctest pattern. Emit ctest when ccmp did
> > comparison to const 0 to save bytes.
> >
> > Bootstrapped & regtested under x86-64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.md (@ccmp): Use ctestcc when
> > operands[3] is const0_rtx.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/apx-ccmp-1.c: Adjust output to scan ctest.
> > * gcc.target/i386/apx-ccmp-2.c: Adjust some condition to
> > compare with 0.
> > ---
> >  gcc/config/i386/i386.md|  6 +-
> >  gcc/testsuite/gcc.target/i386/apx-ccmp-1.c | 10 ++
> >  gcc/testsuite/gcc.target/i386/apx-ccmp-2.c |  4 ++--
> >  3 files changed, 13 insertions(+), 7 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index a64f2ad4f5f..014d48cddd6 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -1522,7 +1522,11 @@ (define_insn "@ccmp"
> >   [(match_operand:SI 4 "const_0_to_15_operand")]
> >   UNSPEC_APX_DFV)))]
> >   "TARGET_APX_CCMP"
> > - "ccmp%C1{}\t%G4 {%3, %2|%2, %3}"
> > + {
> > +   if (operands[3] == const0_rtx && !MEM_P (operands[2]))
> > + return "ctest%C1{}\t%G4 %2, %2";
> > +   return "ccmp%C1{}\t%G4 {%3, %2|%2, %3}";
> > + }
>
> This could be implemented as an alternative using "r,C" constraint as
> the first constraint for operands[2,3]. Then the register allocator
> will match the constraints for you.

Like in the attached (lightly tested) patch.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index a64f2ad4f5f..14d4d8cddca 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1515,14 +1515,17 @@ (define_insn "@ccmp"
 (match_operator 1 "comparison_operator"
  [(reg:CC FLAGS_REG) (const_int 0)])
(compare:CC
- (minus:SWI (match_operand:SWI 2 "nonimmediate_operand" "m,")
-(match_operand:SWI 3 "" ","))
+ (minus:SWI (match_operand:SWI 2 "nonimmediate_operand" ",m,")
+(match_operand:SWI 3 "" 
"C,,"))
  (const_int 0))
(unspec:SI
  [(match_operand:SI 4 "const_0_to_15_operand")]
  UNSPEC_APX_DFV)))]
  "TARGET_APX_CCMP"
- "ccmp%C1{}\t%G4 {%3, %2|%2, %3}"
+ "@
+  ctest%C1{}\t%G4 %2, %2
+  ccmp%C1{}\t%G4 {%3, %2|%2, %3}
+  ccmp%C1{}\t%G4 {%3, %2|%2, %3}"
  [(set_attr "type" "icmp")
   (set_attr "mode" "")
   (set_attr "length_immediate" "1")

Re: [PATCH] [APX CCMP] Use ctestcc when comparing to const 0

2024-06-12 Thread Uros Bizjak

On Wed, Jun 12, 2024 at 5:12 AM Hongyu Wang  wrote:
>
> Hi,
>
> For CTEST, we don't have conditional AND so there's no optimization
> opportunity to write a new ctest pattern. Emit ctest when ccmp did
> comparison to const 0 to save bytes.
>
> Bootstrapped & regtested under x86-64-pc-linux-gnu.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (@ccmp): Use ctestcc when
> operands[3] is const0_rtx.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-ccmp-1.c: Adjust output to scan ctest.
> * gcc.target/i386/apx-ccmp-2.c: Adjust some condition to
> compare with 0.
> ---
>  gcc/config/i386/i386.md|  6 +-
>  gcc/testsuite/gcc.target/i386/apx-ccmp-1.c | 10 ++
>  gcc/testsuite/gcc.target/i386/apx-ccmp-2.c |  4 ++--
>  3 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index a64f2ad4f5f..014d48cddd6 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -1522,7 +1522,11 @@ (define_insn "@ccmp"
>   [(match_operand:SI 4 "const_0_to_15_operand")]
>   UNSPEC_APX_DFV)))]
>   "TARGET_APX_CCMP"
> - "ccmp%C1{}\t%G4 {%3, %2|%2, %3}"
> + {
> +   if (operands[3] == const0_rtx && !MEM_P (operands[2]))
> + return "ctest%C1{}\t%G4 %2, %2";
> +   return "ccmp%C1{}\t%G4 {%3, %2|%2, %3}";
> + }

This could be implemented as an alternative using "r,C" constraint as
the first constraint for operands[2,3]. Then the register allocator
will match the constraints for you.

Uros.

>   [(set_attr "type" "icmp")
>(set_attr "mode" "")
>(set_attr "length_immediate" "1")
> diff --git a/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c 
> b/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
> index e4e112f07e0..a8b70576760 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-ccmp-1.c
> @@ -96,9 +96,11 @@ f15 (double a, double b, int c, int d)
>
>  /* { dg-final { scan-assembler-times "ccmpg" 2 } } */
>  /* { dg-final { scan-assembler-times "ccmple" 2 } } */
> -/* { dg-final { scan-assembler-times "ccmpne" 4 } } */
> -/* { dg-final { scan-assembler-times "ccmpe" 3 } } */
> +/* { dg-final { scan-assembler-times "ccmpne" 2 } } */
> +/* { dg-final { scan-assembler-times "ccmpe" 1 } } */
>  /* { dg-final { scan-assembler-times "ccmpbe" 1 } } */
> +/* { dg-final { scan-assembler-times "ctestne" 2 } } */
> +/* { dg-final { scan-assembler-times "cteste" 2 } } */
>  /* { dg-final { scan-assembler-times "ccmpa" 1 } } */
> -/* { dg-final { scan-assembler-times "ccmpbl" 2 } } */
> -
> +/* { dg-final { scan-assembler-times "ccmpbl" 1 } } */
> +/* { dg-final { scan-assembler-times "ctestbl" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c 
> b/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> index 0123a686d2c..4a0784394c3 100644
> --- a/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> +++ b/gcc/testsuite/gcc.target/i386/apx-ccmp-2.c
> @@ -12,7 +12,7 @@ int foo_apx(int a, int b, int c, int d)
>c += d;
>a += b;
>sum += a + c;
> -  if (b != d && sum < c || sum > d)
> +  if (b > d && sum != 0 || sum > d)
> {
>   b += d;
>   sum += b;
> @@ -32,7 +32,7 @@ int foo_noapx(int a, int b, int c, int d)
>c += d;
>a += b;
>sum += a + c;
> -  if (b != d && sum < c || sum > d)
> +  if (b > d && sum != 0 || sum > d)
> {
>   b += d;
>   sum += b;
> --
> 2.31.1
>

Re: [PATCH] rust: Do not link with libdl and libpthread unconditionally

2024-06-12 Thread Uros Bizjak

On Tue, Jun 11, 2024 at 11:21 AM Arthur Cohen  wrote:
>
> Thanks Richi!
>
> Tested again and pushed on trunk.


This patch introduced a couple of errors during ./configure:

checking for library containing dlopen... none required
checking for library containing pthread_create... none required
/git/gcc/configure: line 8997: test: too many arguments
/git/gcc/configure: line 8999: test: too many arguments
/git/gcc/configure: line 9003: test: too many arguments
/git/gcc/configure: line 9005: test: =: unary operator expected

You have to wrap arguments of the test with double quotes.

Uros.

> Best,
>
> Arthur
>
> On 5/31/24 15:02, Richard Biener wrote:
> > On Fri, May 31, 2024 at 12:24 PM Arthur Cohen  
> > wrote:
> >>
> >> Hi Richard,
> >>
> >> On 4/30/24 09:55, Richard Biener wrote:
> >>> On Fri, Apr 19, 2024 at 11:49 AM Arthur Cohen  
> >>> wrote:
> 
>  Hi everyone,
> 
>  This patch checks for the presence of dlopen and pthread_create in libc. 
>  If that is not the
>  case, we check for the existence of -ldl and -lpthread, as these 
>  libraries are required to
>  link the Rust runtime to our Rust frontend.
> 
>  If these libs are not present on the system, then we disable the Rust 
>  frontend.
> 
>  This was tested on x86_64, in an environment with a recent GLIBC and in 
>  a container with GLIBC
>  2.27.
> 
>  Apologies for sending it in so late.
> >>>
> >>> For example GCC_ENABLE_PLUGINS simply does
> >>>
> >>># Check -ldl
> >>>saved_LIBS="$LIBS"
> >>>AC_SEARCH_LIBS([dlopen], [dl])
> >>>if test x"$ac_cv_search_dlopen" = x"-ldl"; then
> >>>  pluginlibs="$pluginlibs -ldl"
> >>>fi
> >>>LIBS="$saved_LIBS"
> >>>
> >>> which I guess would also work for pthread_create?  This would simplify
> >>> the code a bit.
> >>
> >> Thanks a lot for the review. I've udpated the patch's content in
> >> configure.ac per your suggestion. Tested similarly on x86_64 and in a
> >> container with libc 2.27
> >
> > LGTM.
> >
> > Thanks,
> > Richard.
> >
> >>   From 00669b600a75743523c358ee41ab999b6e9fa0f6 Mon Sep 17 00:00:00 2001
> >> From: Arthur Cohen 
> >> Date: Fri, 12 Apr 2024 13:52:18 +0200
> >> Subject: [PATCH] rust: Do not link with libdl and libpthread 
> >> unconditionally
> >>
> >> ChangeLog:
> >>
> >>  * Makefile.tpl: Add CRAB1_LIBS variable.
> >>  * Makefile.in: Regenerate.
> >>  * configure: Regenerate.
> >>  * configure.ac: Check if -ldl and -lpthread are needed, and if 
> >> so, add
> >>  them to CRAB1_LIBS.
> >>
> >> gcc/rust/ChangeLog:
> >>
> >>  * Make-lang.in: Remove overazealous LIBS = -ldl -lpthread line, 
> >> link
> >>  crab1 against CRAB1_LIBS.
> >> ---
> >>Makefile.in   |   3 +
> >>Makefile.tpl  |   3 +
> >>configure | 154 ++
> >>configure.ac  |  41 +++
> >>gcc/rust/Make-lang.in |   6 +-
> >>5 files changed, 203 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/Makefile.in b/Makefile.in
> >> index edb0c8a9a42..1753fb6b862 100644
> >> --- a/Makefile.in
> >> +++ b/Makefile.in
> >> @@ -197,6 +197,7 @@ HOST_EXPORTS = \
> >>  $(BASE_EXPORTS) \
> >>  CC="$(CC)"; export CC; \
> >>  ADA_CFLAGS="$(ADA_CFLAGS)"; export ADA_CFLAGS; \
> >> +   CRAB1_LIBS="$(CRAB1_LIBS)"; export CRAB1_LIBS; \
> >>  CFLAGS="$(CFLAGS)"; export CFLAGS; \
> >>  CONFIG_SHELL="$(SHELL)"; export CONFIG_SHELL; \
> >>  CXX="$(CXX)"; export CXX; \
> >> @@ -450,6 +451,8 @@ GOCFLAGS = $(CFLAGS)
> >>GDCFLAGS = @GDCFLAGS@
> >>GM2FLAGS = $(CFLAGS)
> >>
> >> +CRAB1_LIBS = @CRAB1_LIBS@
> >> +
> >>PKG_CONFIG_PATH = @PKG_CONFIG_PATH@
> >>
> >>GUILE = guile
> >> diff --git a/Makefile.tpl b/Makefile.tpl
> >> index adbcbdd1d57..4aeaad3c1a5 100644
> >> --- a/Makefile.tpl
> >> +++ b/Makefile.tpl
> >> @@ -200,6 +200,7 @@ HOST_EXPORTS = \
> >>  $(BASE_EXPORTS) \
> >>  CC="$(CC)"; export CC; \
> >>  ADA_CFLAGS="$(ADA_CFLAGS)"; export ADA_CFLAGS; \
> >> +   CRAB1_LIBS="$(CRAB1_LIBS)"; export CRAB1_LIBS; \
> >>  CFLAGS="$(CFLAGS)"; export CFLAGS; \
> >>  CONFIG_SHELL="$(SHELL)"; export CONFIG_SHELL; \
> >>  CXX="$(CXX)"; export CXX; \
> >> @@ -453,6 +454,8 @@ GOCFLAGS = $(CFLAGS)
> >>GDCFLAGS = @GDCFLAGS@
> >>GM2FLAGS = $(CFLAGS)
> >>
> >> +CRAB1_LIBS = @CRAB1_LIBS@
> >> +
> >>PKG_CONFIG_PATH = @PKG_CONFIG_PATH@
> >>
> >>GUILE = guile
> >> diff --git a/configure b/configure
> >> index 02b435c1163..a9ea5258f0f 100755
> >> --- a/configure
> >> +++ b/configure
> >> @@ -690,6 +690,7 @@ extra_host_zlib_configure_flags
> >>extra_host_libiberty_configure_flags
> >>stage1_languages
> >>host_libs_picflag
> >> +CRAB1_LIBS
> >>PICFLAG
> >>host_shared
> >>gcc_host_pie
> >> @@ -8826,6 +8827,139 @@ fi
> >>
> >>
> >>
> >> +# Rust req

[committed] i386: Use CMOV in .SAT_{ADD|SUB} expansion for TARGET_CMOV [PR112600]

2024-06-11 Thread Uros Bizjak

For TARGET_CMOV targets emit insn sequence involving conditional move.

.SAT_ADD:

addl%esi, %edi
movl$-1, %eax
cmovnc  %edi, %eax
ret

.SAT_SUB:

subl%esi, %edi
movl$0, %eax
cmovnc  %edi, %eax
ret

PR target/112600

gcc/ChangeLog:

* config/i386/i386.md (usadd3): Emit insn sequence
involving conditional move for TARGET_CMOVE targets.
(ussub3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112600-a.c: Also scan for cmov.
* gcc.target/i386/pr112600-b.c: Ditto.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index d69bc8d6e48..a64f2ad4f5f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9885,13 +9885,35 @@ (define_expand "usadd3"
   ""
 {
   rtx res = gen_reg_rtx (mode);
-  rtx msk = gen_reg_rtx (mode);
   rtx dst;
 
   emit_insn (gen_add3_cc_overflow_1 (res, operands[1], operands[2]));
-  emit_insn (gen_x86_movcc_0_m1_neg (msk));
-  dst = expand_simple_binop (mode, IOR, res, msk,
-operands[0], 1, OPTAB_WIDEN);
+
+  if (TARGET_CMOVE)
+{
+  rtx cmp = gen_rtx_GEU (VOIDmode, gen_rtx_REG (CCCmode, FLAGS_REG),
+const0_rtx);
+
+  if ( < GET_MODE_SIZE (SImode))
+   {
+ dst = force_reg (mode, operands[0]);
+ emit_insn (gen_movsicc (gen_lowpart (SImode, dst), cmp,
+ gen_lowpart (SImode, res), constm1_rtx));
+   }
+   else
+   {
+ dst = operands[0];
+ emit_insn (gen_movcc (dst, cmp, res, constm1_rtx));
+   }
+}
+  else
+{
+  rtx msk = gen_reg_rtx (mode);
+
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  dst = expand_simple_binop (mode, IOR, res, msk,
+operands[0], 1, OPTAB_WIDEN);
+}
 
   if (!rtx_equal_p (dst, operands[0]))
 emit_move_insn (operands[0], dst);
@@ -9905,14 +9927,36 @@ (define_expand "ussub3"
   ""
 {
   rtx res = gen_reg_rtx (mode);
-  rtx msk = gen_reg_rtx (mode);
   rtx dst;
 
   emit_insn (gen_sub_3 (res, operands[1], operands[2]));
-  emit_insn (gen_x86_movcc_0_m1_neg (msk));
-  msk = expand_simple_unop (mode, NOT, msk, NULL, 1);
-  dst = expand_simple_binop (mode, AND, res, msk,
-operands[0], 1, OPTAB_WIDEN);
+
+  if (TARGET_CMOVE)
+{
+  rtx cmp = gen_rtx_GEU (VOIDmode, gen_rtx_REG (CCCmode, FLAGS_REG),
+const0_rtx);
+
+  if ( < GET_MODE_SIZE (SImode))
+   {
+ dst = force_reg (mode, operands[0]);
+ emit_insn (gen_movsicc (gen_lowpart (SImode, dst), cmp,
+ gen_lowpart (SImode, res), const0_rtx));
+   }
+   else
+   {
+ dst = operands[0];
+ emit_insn (gen_movcc (dst, cmp, res, const0_rtx));
+   }
+}
+  else
+{
+  rtx msk = gen_reg_rtx (mode);
+
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  msk = expand_simple_unop (mode, NOT, msk, NULL, 1);
+  dst = expand_simple_binop (mode, AND, res, msk,
+operands[0], 1, OPTAB_WIDEN);
+}
 
   if (!rtx_equal_p (dst, operands[0]))
 emit_move_insn (operands[0], dst);
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-a.c 
b/gcc/testsuite/gcc.target/i386/pr112600-a.c
index fa122bc7a3f..2b084860451 100644
--- a/gcc/testsuite/gcc.target/i386/pr112600-a.c
+++ b/gcc/testsuite/gcc.target/i386/pr112600-a.c
@@ -1,7 +1,7 @@
 /* PR target/112600 */
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
-/* { dg-final { scan-assembler-times "sbb" 4 } } */
+/* { dg-final { scan-assembler-times "sbb|cmov" 4 } } */
 
 unsigned char
 add_sat_char (unsigned char x, unsigned char y)
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-b.c 
b/gcc/testsuite/gcc.target/i386/pr112600-b.c
index ea14bb9738b..ac4e26423b6 100644
--- a/gcc/testsuite/gcc.target/i386/pr112600-b.c
+++ b/gcc/testsuite/gcc.target/i386/pr112600-b.c
@@ -1,7 +1,7 @@
 /* PR target/112600 */
 /* { dg-do compile } */
 /* { dg-options "-O2" } */
-/* { dg-final { scan-assembler-times "sbb" 4 } } */
+/* { dg-final { scan-assembler-times "sbb|cmov" 4 } } */
 
 unsigned char
 sub_sat_char (unsigned char x, unsigned char y)

[committed] i386: Implement .SAT_SUB for unsigned scalar integers [PR112600]

2024-06-09 Thread Uros Bizjak

The following testcase:

unsigned
sub_sat (unsigned x, unsigned y)
{
  unsigned res;
  res = x - y;
  res &= -(x >= y);
  return res;
}

currently compiles (-O2) to:

sub_sat:
movl%edi, %edx
xorl%eax, %eax
subl%esi, %edx
cmpl%esi, %edi
setnb   %al
negl%eax
andl%edx, %eax
ret

We can expand through ussub{m}3 optab to use carry flag from the subtraction
and generate code using SBB instruction implementing:

unsigned res = x - y;
res &= ~(-(x < y));

sub_sat:
subl%esi, %edi
sbbl%eax, %eax
notl%eax
andl%edi, %eax
ret

PR target/112600

gcc/ChangeLog:

* config/i386/i386.md (ussub3): New expander.
(sub_3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112600-b.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index bc2ef819df6..d69bc8d6e48 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -8436,6 +8436,14 @@ (define_expand "usubv4"
   "ix86_fixup_binary_operands_no_copy (MINUS, mode, operands,
   TARGET_APX_NDD);")
 
+(define_expand "sub_3"
+  [(parallel [(set (reg:CC FLAGS_REG)
+  (compare:CC
+(match_operand:SWI 1 "nonimmediate_operand")
+(match_operand:SWI 2 "")))
+ (set (match_operand:SWI 0 "register_operand")
+  (minus:SWI (match_dup 1) (match_dup 2)))])])
+
 (define_insn "*sub_3"
   [(set (reg FLAGS_REG)
(compare (match_operand:SWI 1 "nonimmediate_operand" "0,0,rm,r")
@@ -9883,7 +9891,28 @@ (define_expand "usadd3"
   emit_insn (gen_add3_cc_overflow_1 (res, operands[1], operands[2]));
   emit_insn (gen_x86_movcc_0_m1_neg (msk));
   dst = expand_simple_binop (mode, IOR, res, msk,
-operands[0], 1, OPTAB_DIRECT);
+operands[0], 1, OPTAB_WIDEN);
+
+  if (!rtx_equal_p (dst, operands[0]))
+emit_move_insn (operands[0], dst);
+  DONE;
+})
+
+(define_expand "ussub3"
+  [(set (match_operand:SWI 0 "register_operand")
+   (us_minus:SWI (match_operand:SWI 1 "register_operand")
+ (match_operand:SWI 2 "")))]
+  ""
+{
+  rtx res = gen_reg_rtx (mode);
+  rtx msk = gen_reg_rtx (mode);
+  rtx dst;
+
+  emit_insn (gen_sub_3 (res, operands[1], operands[2]));
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  msk = expand_simple_unop (mode, NOT, msk, NULL, 1);
+  dst = expand_simple_binop (mode, AND, res, msk,
+operands[0], 1, OPTAB_WIDEN);
 
   if (!rtx_equal_p (dst, operands[0]))
 emit_move_insn (operands[0], dst);

Re: [committed] i386: Implement .SAT_ADD for unsigned scalar integers [PR112600]

2024-06-08 Thread Uros Bizjak

On Sat, Jun 8, 2024 at 2:09 PM Gerald Pfeifer  wrote:
>
> On Sat, 8 Jun 2024, Uros Bizjak wrote:
> > gcc/ChangeLog:
> >
> > * config/i386/i386.md (usadd3): New expander.
> > (x86_movcc_0_m1_neg): Use SWI mode iterator.
>
> When you write "committed", did you actually push?

Yes, IIRC, the request was to mark pushed change with the word "committed".

> If so, us being on Git now it might be good to adjust terminology.

No problem, I can say "pushed" if that is more descriptive.

Thanks,
Uros.

[committed] i386: Implement .SAT_ADD for unsigned scalar integers [PR112600]

2024-06-08 Thread Uros Bizjak

The following testcase:

unsigned
add_sat(unsigned x, unsigned y)
{
unsigned z;
return __builtin_add_overflow(x, y, &z) ? -1u : z;
}

currently compiles (-O2) to:

add_sat:
addl%esi, %edi
jc  .L3
movl%edi, %eax
ret
.L3:
orl $-1, %eax
ret

We can expand through usadd{m}3 optab to use carry flag from the addition
and generate branchless code using SBB instruction implementing:

unsigned res = x + y;
res |= -(res < x);

add_sat:
addl%esi, %edi
sbbl%eax, %eax
orl %edi, %eax
ret

PR target/112600

gcc/ChangeLog:

* config/i386/i386.md (usadd3): New expander.
(x86_movcc_0_m1_neg): Use SWI mode iterator.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112600-a.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index ffcf63e1cba..bc2ef819df6 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9870,6 +9870,26 @@ (define_insn_and_split "*sub3_ne_0"
 operands[1] = force_reg (mode, operands[1]);
 })
 
+(define_expand "usadd3"
+  [(set (match_operand:SWI 0 "register_operand")
+   (us_plus:SWI (match_operand:SWI 1 "register_operand")
+(match_operand:SWI 2 "")))]
+  ""
+{
+  rtx res = gen_reg_rtx (mode);
+  rtx msk = gen_reg_rtx (mode);
+  rtx dst;
+
+  emit_insn (gen_add3_cc_overflow_1 (res, operands[1], operands[2]));
+  emit_insn (gen_x86_movcc_0_m1_neg (msk));
+  dst = expand_simple_binop (mode, IOR, res, msk,
+operands[0], 1, OPTAB_DIRECT);
+
+  if (!rtx_equal_p (dst, operands[0]))
+emit_move_insn (operands[0], dst);
+  DONE;
+})
+
 ;; The patterns that match these are at the end of this file.
 
 (define_expand "xf3"
@@ -24945,8 +24965,8 @@ (define_insn "*x86_movcc_0_m1_neg"
 
 (define_expand "x86_movcc_0_m1_neg"
   [(parallel
-[(set (match_operand:SWI48 0 "register_operand")
- (neg:SWI48 (ltu:SWI48 (reg:CCC FLAGS_REG) (const_int 0
+[(set (match_operand:SWI 0 "register_operand")
+ (neg:SWI (ltu:SWI (reg:CCC FLAGS_REG) (const_int 0
  (clobber (reg:CC FLAGS_REG))])])
 
 (define_split
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-a.c 
b/gcc/testsuite/gcc.target/i386/pr112600-a.c
new file mode 100644
index 000..fa122bc7a3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112600-a.c
@@ -0,0 +1,32 @@
+/* PR target/112600 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-final { scan-assembler-times "sbb" 4 } } */
+
+unsigned char
+add_sat_char (unsigned char x, unsigned char y)
+{
+  unsigned char z;
+  return __builtin_add_overflow(x, y, &z) ? -1u : z;
+}
+
+unsigned short
+add_sat_short (unsigned short x, unsigned short y)
+{
+  unsigned short z;
+  return __builtin_add_overflow(x, y, &z) ? -1u : z;
+}
+
+unsigned int
+add_sat_int (unsigned int x, unsigned int y)
+{
+  unsigned int z;
+  return __builtin_add_overflow(x, y, &z) ? -1u : z;
+}
+
+unsigned long
+add_sat_long (unsigned long x, unsigned long y)
+{
+  unsigned long z;
+  return __builtin_add_overflow(x, y, &z) ? -1ul : z;
+}

Re: [PATCH v2 2/6] Extract ix86 dllimport implementation to mingw

2024-06-07 Thread Uros Bizjak

On Fri, Jun 7, 2024 at 11:48 AM Evgeny Karpov
 wrote:
>
> This patch extracts the ix86 implementation for expanding a SYMBOL
> into its corresponding dllimport, far-address, or refptr symbol.
> It will be reused in the aarch64-w64-mingw32 target.
> The implementation is copied as is from i386/i386.cc with
> minor changes to follow to the code style.
>
> Also this patch replaces the original DLL import/export
> implementation in ix86 with mingw.
>
> gcc/ChangeLog:
>
> * config.gcc: Add winnt-dll.o, which contains the DLL
> import/export implementation.
> * config/i386/cygming.h (SUB_TARGET_RECORD_STUB): Remove the
> old implementation. Rename the required function to MinGW.
> Use MinGW implementation for COFF and nothing otherwise.
> (GOT_ALIAS_SET): Likewise.
> * config/i386/i386-expand.cc (ix86_expand_move): Likewise.
> * config/i386/i386-expand.h (ix86_GOT_alias_set): Likewise.
> (legitimize_pe_coff_symbol): Likewise.
> * config/i386/i386-protos.h (i386_pe_record_stub): Likewise.
> * config/i386/i386.cc (is_imported_p): Likewise.
> (legitimate_pic_address_disp_p): Likewise.
> (ix86_GOT_alias_set): Likewise.
> (legitimize_pic_address): Likewise.
> (legitimize_tls_address): Likewise.
> (struct dllimport_hasher): Likewise.
> (GTY): Likewise.
> (get_dllimport_decl): Likewise.
> (legitimize_pe_coff_extern_decl): Likewise.
> (legitimize_dllimport_symbol): Likewise.
> (legitimize_pe_coff_symbol): Likewise.
> (ix86_legitimize_address): Likewise.
> * config/i386/i386.h (GOT_ALIAS_SET): Likewise.
> * config/mingw/winnt.cc (i386_pe_record_stub): Likewise.
> (mingw_pe_record_stub): Likewise.
> * config/mingw/winnt.h (mingw_pe_record_stub): Likewise.
> * config/mingw/t-cygming: Add the winnt-dll.o compilation.
> * config/mingw/winnt-dll.cc: New file.
> * config/mingw/winnt-dll.h: New file.

LGTM for generic x86 changes.

Thanks,
Uros.

> ---
>  gcc/config.gcc |  12 +-
>  gcc/config/i386/cygming.h  |   5 +-
>  gcc/config/i386/i386-expand.cc |   4 +-
>  gcc/config/i386/i386-expand.h  |   2 -
>  gcc/config/i386/i386-protos.h  |   1 -
>  gcc/config/i386/i386.cc| 205 ++---
>  gcc/config/i386/i386.h |   2 +
>  gcc/config/mingw/t-cygming |   6 +
>  gcc/config/mingw/winnt-dll.cc  | 231 +
>  gcc/config/mingw/winnt-dll.h   |  30 +
>  gcc/config/mingw/winnt.cc  |   2 +-
>  gcc/config/mingw/winnt.h   |   1 +
>  12 files changed, 298 insertions(+), 203 deletions(-)
>  create mode 100644 gcc/config/mingw/winnt-dll.cc
>  create mode 100644 gcc/config/mingw/winnt-dll.h
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 553a310f4bd..d053b98efa8 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -2177,11 +2177,13 @@ i[4567]86-wrs-vxworks*|x86_64-wrs-vxworks7*)
>  i[34567]86-*-cygwin*)
> tm_file="${tm_file} i386/unix.h i386/bsd.h i386/gas.h i386/cygming.h 
> i386/cygwin.h i386/cygwin-stdint.h"
> tm_file="${tm_file} mingw/winnt.h"
> +   tm_file="${tm_file} mingw/winnt-dll.h"
> xm_file=i386/xm-cygwin.h
> tmake_file="${tmake_file} mingw/t-cygming t-slibgcc"
> target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt.cc"
> +   target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt-dll.cc"
> extra_options="${extra_options} mingw/cygming.opt i386/cygwin.opt"
> -   extra_objs="${extra_objs} winnt.o winnt-stubs.o"
> +   extra_objs="${extra_objs} winnt.o winnt-stubs.o winnt-dll.o"
> c_target_objs="${c_target_objs} msformat-c.o"
> cxx_target_objs="${cxx_target_objs} winnt-cxx.o msformat-c.o"
> d_target_objs="${d_target_objs} cygwin-d.o"
> @@ -2196,11 +2198,13 @@ x86_64-*-cygwin*)
> need_64bit_isa=yes
> tm_file="${tm_file} i386/unix.h i386/bsd.h i386/gas.h i386/cygming.h 
> i386/cygwin.h i386/cygwin-w64.h i386/cygwin-stdint.h"
> tm_file="${tm_file} mingw/winnt.h"
> +   tm_file="${tm_file} mingw/winnt-dll.h"
> xm_file=i386/xm-cygwin.h
> tmake_file="${tmake_file} mingw/t-cygming t-slibgcc"
> target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt.cc"
> +   target_gtfiles="$target_gtfiles \$(srcdir)/config/mingw/winnt-dll.cc"
> extra_options="${extra_options} mingw/cygming.opt i386/cygwin.opt"
> -   extra_objs="${extra_objs} winnt.o winnt-stubs.o"
> +   extra_objs="${extra_objs} winnt.o winnt-stubs.o winnt-dll.o"
> c_target_objs="${c_target_objs} msformat-c.o"
> cxx_target_objs="${cxx_target_objs} winnt-cxx.o msformat-c.o"
> d_target_objs="${d_target_objs} cygwin-d.o"
> @@ -2266,6 +2270,7 @@ i[34567]86-*-mingw* | x86_64-*-mingw*)
> esac
> tm_file="${tm_file} mingw/mingw-stdint.h"
>

Re: [x86 PATCH] PR target/115351: RTX costs for concatditi3 and insvti_highpart.

2024-06-07 Thread Uros Bizjak

On Fri, Jun 7, 2024 at 11:21 AM Roger Sayle  wrote:
>
>
> This patch addresses PR target/115351, which is a code quality regression
> on x86 when passing floating point complex numbers.  The ABI considers
> these arguments to have TImode, requiring interunit moves to place the
> FP values (which are actually passed in SSE registers) into the upper
> and lower parts of a TImode pseudo, and then similar moves back again
> before they can be used.
>
> The cause of the regression is that changes in how TImode initialization
> is represented in RTL now prevents the RTL optimizers from eliminating
> these redundant moves.  The specific cause is that the *concatditi3
> pattern, (zext(hi)<<64)|zext(lo), has an inappropriately high (default)
> rtx_cost, preventing fwprop1 from propagating it.  This pattern just
> sets the hipart and lopart of a double-word register, typically two
> instructions (less if reload can allocate things appropriately) but
> the current ix86_rtx_costs actually returns INSN_COSTS(13), i.e. 52.
>
> propagating insn 5 into insn 6, replacing:
> (set (reg:TI 110)
> (ior:TI (and:TI (reg:TI 110)
> (const_wide_int 0x0))
> (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 112 [ zD.2796+8 ]) 0))
> (const_int 64 [0x40]
> successfully matched this instruction to *concatditi3_3:
> (set (reg:TI 110)
> (ior:TI (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 112 [ zD.2796+8 ])
> 0))
> (const_int 64 [0x40]))
> (zero_extend:TI (subreg:DI (reg:DF 111 [ zD.2796 ]) 0
> change not profitable (cost 50 -> cost 52)
>
> This issue is resolved by having ix86_rtx_costs return more reasonable
> values for these (place-holder) patterns.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-06-07  Roger Sayle  
>
> gcc/ChangeLog
> PR target/115351
> * config/i386/i386.cc (ix86_rtx_costs): Provide estimates for the
> *concatditi3 and *insvti_highpart patterns, about two insns.
>
> gcc/testsuite/ChangeLog
> PR target/115351
> * g++.target/i386/pr115351.C: New test case.

LGTM.

Thanks,
Uros.

>
>
> Thanks in advance (and sorry for any inconvenience),
> Roger
> --
>

[committed] testsuite/i386: Add vector sat_sub testcases [PR112600]

2024-06-06 Thread Uros Bizjak

PR middle-end/112600

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112600-2a.c: New test.
* gcc.target/i386/pr112600-2b.c: New test.

Tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-2a.c 
b/gcc/testsuite/gcc.target/i386/pr112600-2a.c
new file mode 100644
index 000..4df38e5a720
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112600-2a.c
@@ -0,0 +1,15 @@
+/* PR middle-end/112600 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+typedef unsigned char T;
+
+void foo (T *out, T *x, T *y, int n)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+out[i] = (x[i] - y[i]) & (-(T)(x[i] >= y[i]));
+}
+
+/* { dg-final { scan-assembler "psubusb" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr112600-2b.c 
b/gcc/testsuite/gcc.target/i386/pr112600-2b.c
new file mode 100644
index 000..0f6345de704
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr112600-2b.c
@@ -0,0 +1,15 @@
+/* PR middle-end/112600 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+typedef unsigned short T;
+
+void foo (T *out, T *x, T *y, int n)
+{
+  int i;
+
+  for (i = 0; i < n; i++)
+out[i] = (x[i] - y[i]) & (-(T)(x[i] >= y[i]));
+}
+
+/* { dg-final { scan-assembler "psubusw" } } */

Re: [PATCH v1] Internal-fn: Support new IFN SAT_SUB for unsigned scalar int

2024-06-05 Thread Uros Bizjak

On Wed, Jun 5, 2024 at 10:52 AM Li, Pan2  wrote:
>
> Thanks for explaining. I see, cmove is well designed for such cases.

If the question is if it is worth it to convert using
__builtin_sub_overflow here if the target doesn't provide scalar
saturating optab, I think the answer is yes. For x86, the compare will
be eliminated.

Please consider this testcase:

--cut here--
unsigned int
__attribute__((noinline))
foo (unsigned int x, unsigned int y)
{
  return x > y ? x - y : 0;
}

unsigned int
__attribute__((noinline))
bar (unsigned int x, unsigned int y)
{
  unsigned int z;

  return __builtin_sub_overflow (x, y, &z) ? 0 : z;
}
--cut here--

This will compile to:

 :
  0:   89 f8   mov%edi,%eax
  2:   31 d2   xor%edx,%edx
  4:   29 f0   sub%esi,%eax
  6:   39 fe   cmp%edi,%esi
  8:   0f 43 c2cmovae %edx,%eax
  b:   c3  ret
  c:   0f 1f 40 00 nopl   0x0(%rax)

0010 :
 10:   29 f7   sub%esi,%edi
 12:   72 03   jb 17 
 14:   89 f8   mov%edi,%eax
 16:   c3  ret
 17:   31 c0   xor%eax,%eax
 19:   c3  ret

Please note that the compare was eliminated in the later test. So, if
the target does not provide saturated optab but provides
__builtin_sub_overflow, I think it is worth emitting .SAT_SUB via
__builtin_sub_overflow (and in similar way for saturated add).

Uros.


>
> Pan
>
> -Original Message-
> From: Uros Bizjak 
> Sent: Wednesday, June 5, 2024 4:46 PM
> To: Li, Pan2 
> Cc: Richard Biener ; gcc-patches@gcc.gnu.org; 
> juzhe.zh...@rivai.ai; kito.ch...@gmail.com; tamar.christ...@arm.com
> Subject: Re: [PATCH v1] Internal-fn: Support new IFN SAT_SUB for unsigned 
> scalar int
>
> On Wed, Jun 5, 2024 at 10:38 AM Li, Pan2  wrote:
> >
> > > I see. x86 doesn't have scalar saturating instructions, so the scalar
> > > version indeed can't be converted.
> >
> > > I will amend x86 testcases after the vector part of your patch is 
> > > committed.
> >
> > Thanks for the confirmation. Just curious, the .SAT_SUB for scalar has 
> > sorts of forms, like a branch version as below.
> >
> > .SAT_SUB (x, y) = x > y ? x - y : 0. // or leverage __builtin_sub_overflow 
> > here
> >
> > It is reasonable to implement the scalar .SAT_SUB for x86? Given somehow we 
> > can eliminate the branch here.
>
> x86 will emit cmove in the above case:
>
>movl%edi, %eax
>xorl%edx, %edx
>subl%esi, %eax
>cmpl%edi, %esi
>cmovnb  %edx, %eax
>
> Maybe we can reuse flags from the subtraction here to avoid the compare.
>
> Uros.

Re: [PATCH v1] Internal-fn: Support new IFN SAT_SUB for unsigned scalar int

2024-06-05 Thread Uros Bizjak

On Wed, Jun 5, 2024 at 10:38 AM Li, Pan2  wrote:
>
> > I see. x86 doesn't have scalar saturating instructions, so the scalar
> > version indeed can't be converted.
>
> > I will amend x86 testcases after the vector part of your patch is committed.
>
> Thanks for the confirmation. Just curious, the .SAT_SUB for scalar has sorts 
> of forms, like a branch version as below.
>
> .SAT_SUB (x, y) = x > y ? x - y : 0. // or leverage __builtin_sub_overflow 
> here
>
> It is reasonable to implement the scalar .SAT_SUB for x86? Given somehow we 
> can eliminate the branch here.

x86 will emit cmove in the above case:

   movl%edi, %eax
   xorl%edx, %edx
   subl%esi, %eax
   cmpl%edi, %esi
   cmovnb  %edx, %eax

Maybe we can reuse flags from the subtraction here to avoid the compare.

Uros.

Re: [PATCH v1] Internal-fn: Support new IFN SAT_SUB for unsigned scalar int

2024-06-05 Thread Uros Bizjak

On Wed, Jun 5, 2024 at 10:22 AM Li, Pan2  wrote:
>
> > Is the above testcase correct? You need "(x + y)" as the first term.
>
> Thanks for comments, should be copy issue here, you can take SAT_SUB (x, y) 
> => (x - y) & (-(TYPE)(x >= y)) or below template for reference.
>
> +#define DEF_SAT_U_SUB_FMT_1(T) \
> +T __attribute__((noinline))\
> +sat_u_sub_##T##_fmt_1 (T x, T y)   \
> +{  \
> +  return (x - y) & (-(T)(x >= y)); \
> +}
> +
> +#define DEF_SAT_U_SUB_FMT_2(T)\
> +T __attribute__((noinline))   \
> +sat_u_sub_##T##_fmt_2 (T x, T y)  \
> +{ \
> +  return (x - y) & (-(T)(x > y)); \
> +}
>
> > BTW: After applying your patch, I'm not able to produce .SAT_SUB with
> > x86_64 and the following testcase:
>
> You mean vectorize part? This patch is only for unsigned scalar int (see 
> title) and the below is the vect part.
> Could you please help to double confirm if you cannot see .SAT_SUB after 
> widen_mul pass in x86 for unsigned scalar int?
> Of course, I will have a try later as in the middle of sth.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/653024.html

I see. x86 doesn't have scalar saturating instructions, so the scalar
version indeed can't be converted.

I will amend x86 testcases after the vector part of your patch is committed.

Thanks,
Uros.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1014 matches

Mail list logo