Re: [PATCH] i386: Disallow long address mode in the x32 mode. [PR 117418]

2024-11-07 Thread Hongtao Liu
On Fri, Nov 8, 2024 at 12:18 PM H.J. Lu  wrote:
>
> On Fri, Nov 8, 2024 at 10:41 AM Hu, Lin1  wrote:
> >
> > Hi, all
> >
> > -maddress-mode=long will let Pmode = DI_mode, but -mx32 request x32 ABI.
> > So raise an error to avoid ICE.
> >
> > Bootstrapped and regtested, OK for trunk?
> >
> > BRs,
> > Lin
> >
> > gcc/ChangeLog:
> >
> > PR target/117418
> > * config/i386/i386-options.cc (ix86_option_override_internal): 
> > raise an
> > error with option -mx32 -maddress-mode=long.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR target/117418
> > * gcc.target/i386/pr117418-1.c: New test.
> > ---
> >  gcc/config/i386/i386-options.cc|  4 
> >  gcc/testsuite/gcc.target/i386/pr117418-1.c | 13 +
> >  2 files changed, 17 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117418-1.c
> >
> > diff --git a/gcc/config/i386/i386-options.cc 
> > b/gcc/config/i386/i386-options.cc
> > index 239269ecbdd..ba1abea2537 100644
> > --- a/gcc/config/i386/i386-options.cc
> > +++ b/gcc/config/i386/i386-options.cc
> > @@ -2190,6 +2190,10 @@ ix86_option_override_internal (bool main_args_p,
> > error ("address mode %qs not supported in the %s bit mode",
> >TARGET_64BIT_P (opts->x_ix86_isa_flags) ? "short" : "long",
> >TARGET_64BIT_P (opts->x_ix86_isa_flags) ? "64" : "32");
> > +
> > +  if (TARGET_X32_P (opts->x_ix86_isa_flags)
> > + && opts_set->x_ix86_pmode == PMODE_DI)
> > +   error ("address mode 'long' not supported in the x32 ABI");
>
> This looks wrong.   Try the encoded patch.
>
So it means -maddress-mode=long will override x32 to use 64-bit pointer?
> >  }
> >else
> >  opts->x_ix86_pmode = TARGET_LP64_P (opts->x_ix86_isa_flags)
> > diff --git a/gcc/testsuite/gcc.target/i386/pr117418-1.c 
> > b/gcc/testsuite/gcc.target/i386/pr117418-1.c
> > new file mode 100644
> > index 000..08430ef9d4b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr117418-1.c
> > @@ -0,0 +1,13 @@
> > +/* PR target/117418 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-maddress-mode=long -mwidekl -mx32" } */
> > +/* { dg-error "address mode 'long' not supported in the x32 ABI" "" { 
> > target *-*-* } 0 } */
> > +
> > +typedef __attribute__((__vector_size__(16))) long long V;
> > +V a;
> > +
> > +void
> > +foo()
> > +{
> > +__builtin_ia32_encodekey256_u32(0, a, a, &a);
> > +}
> > --
> > 2.31.1
> >
>
>
> --
> H.J.



-- 
BR,
Hongtao


Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-07 Thread Hongtao Liu
On Fri, Nov 8, 2024 at 10:21 AM Mayshao-oc  wrote:
>
> > > -Original Message-
> > > From: Xi Ruoyao 
> > > Sent: Thursday, November 7, 2024 1:12 PM
> > > To: Liu, Hongtao ; Mayshao-oc  > > o...@zhaoxin.com>; Hongtao Liu 
> > > Cc: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; ubiz...@gmail.com;
> > > richard.guent...@gmail.com; Tim Hu(WH-RD) ; Silvia
> > > Zhao(BJ-RD) ; Louis Qi(BJ-RD)
> > > ; Cobe Chen(BJ-RD) 
> > > Subject: Re: [PATCH] [x86_64] Add microarchtecture tunable for
> > > pass_align_tight_loops
> > > On Thu, 2024-11-07 at 04:58 +, Liu, Hongtao wrote:
> > > > > > > Hi all:
> > > > > > > For zhaoxin, I find no improvement when enable
> > > > > > > pass_align_tight_loops, and have performance drop in some cases.
> > > > > > > This patch add a new tunable to bypass
> > > > > > > pass_align_tight_loops in
> > > > > zhaoxin.
> > > > > > >
> > > > > > > Bootstrapped X86_64.
> > > > > > > Ok for trunk?
> > > > LGTM.
> > >
> > > I'd suggest to add the reference to PR 117438 into the subject and 
> > > ChangeLog.
> > Yes, thanks.
> Add PR 117438 into the subject and ChangeLog.
PR target/117438
Others LGTM.
> > >
> > > --
> > > Xi Ruoyao 
> > > School of Aerospace Science and Technology, Xidian University
> BR
> Mayshao



-- 
BR,
Hongtao


Re: [PATCH v4 7/8] i386: Add zero maskload else operand.

2024-11-07 Thread Hongtao Liu
On Fri, Nov 8, 2024 at 1:58 AM Robin Dapp  wrote:
>
> From: Robin Dapp 
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (maskload):
> Call maskload..._1.
> (maskload_1): Rename.
Ok for x86 part.
> ---
>  gcc/config/i386/sse.md | 21 ++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 22c6c817dd7..1523e2c4d75 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -28641,7 +28641,7 @@ (define_insn 
> "_maskstore"
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "")])
>
> -(define_expand "maskload"
> +(define_expand "maskload_1"
>[(set (match_operand:V48_128_256 0 "register_operand")
> (unspec:V48_128_256
>   [(match_operand: 2 "register_operand")
> @@ -28649,13 +28649,28 @@ (define_expand "maskload"
>   UNSPEC_MASKMOV))]
>"TARGET_AVX")
>
> +(define_expand "maskload"
> +  [(set (match_operand:V48_128_256 0 "register_operand")
> +   (unspec:V48_128_256
> + [(match_operand: 2 "register_operand")
> +  (match_operand:V48_128_256 1 "memory_operand")
> +  (match_operand:V48_128_256 3 "const0_operand")]
> + UNSPEC_MASKMOV))]
> +  "TARGET_AVX"
> +{
> +  emit_insn (gen_maskload_1 (operands[0],
> +  operands[1],
> +  operands[2]));
> +  DONE;
> +})
> +
>  (define_expand "maskload"
>[(set (match_operand:V48_AVX512VL 0 "register_operand")
> (vec_merge:V48_AVX512VL
>   (unspec:V48_AVX512VL
> [(match_operand:V48_AVX512VL 1 "memory_operand")]
> UNSPEC_MASKLOAD)
> - (match_dup 0)
> +  (match_operand:V48_AVX512VL 3 "const0_operand")
>   (match_operand: 2 "register_operand")))]
>"TARGET_AVX512F")
>
> @@ -28665,7 +28680,7 @@ (define_expand "maskload"
>   (unspec:VI12HFBF_AVX512VL
> [(match_operand:VI12HFBF_AVX512VL 1 "memory_operand")]
> UNSPEC_MASKLOAD)
> - (match_dup 0)
> +  (match_operand:VI12HFBF_AVX512VL 3 "const0_operand")
>   (match_operand: 2 "register_operand")))]
>"TARGET_AVX512BW")
>
> --
> 2.47.0
>


-- 
BR,
Hongtao


Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.

2024-11-07 Thread Hongtao Liu
On Thu, Nov 7, 2024 at 3:52 PM Jakub Jelinek  wrote:
>
> On Thu, Nov 07, 2024 at 01:57:21PM +0800, Hongtao Liu wrote:
> > > Does it turn the sNaNs into infinities or qNaNs silently?
> > Yes.
>
> Into infinities?
Into qNaNs(Sorry, I didn't see it clearly.)
>
> > > Given the rounding, flag_rounding_math should avoid the hw instructions,
> > The default rounding mode for flag_rounding_math is rounding to
> > nearest, so I assume !flag_rounding_math is not needed for the
> > condition.
>
> flag_rounding_math is about it being ok to change the rounding mode at
> runtime.  So, with flag_rounding_mode you can't rely on the rounding mode
> being to nearest, with !flag_rounding_mode we do rely on it.
I see.
> So !flag_rounding_math is needed.  It is not on by default, so it isn't
> a big deal...
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Add -mavx512vl for pr117304-1.c

2024-11-06 Thread Hongtao Liu
On Thu, Nov 7, 2024 at 2:04 PM Hu, Lin1  wrote:
>
> > -Original Message-
> > From: Liu, Hongtao 
> > Sent: Thursday, November 7, 2024 11:41 AM
> > To: Hu, Lin1 ; gcc-patches@gcc.gnu.org
> > Cc: ubiz...@gmail.com
> > Subject: RE: [PATCH] i386: Add -mavx512vl for pr117304-1.c
> >
> >
> >
> > > -Original Message-
> > > From: Hu, Lin1 
> > > Sent: Thursday, November 7, 2024 11:03 AM
> > > To: gcc-patches@gcc.gnu.org
> > > Cc: Liu, Hongtao ; ubiz...@gmail.com
> > > Subject: [PATCH] i386: Add -mavx512vl for pr117304-1.c
> > >
> > > Hi, all
> > >
> > > Testing pr117304-1.c in a machine with only avx2 generates some
> > > different hints, so add -mavx512vl at its option list.
> > Didn't quite understand, what kind of hint it is, why avx512vl is needed?
>
> When I cherry-pick this patch based releases/gcc-14, I found if without 
> -mavx512vl, the hint will be __builtin_ia32_cvtdq2ps256 rather than 
> __builtin_ia32_cvtudq2ps128_mask. Based on lookup_name_fuzzy's comment " Look 
> for the closest match for NAME within the currently valid scopes.", I think 
> the hint is right. And the trunk's hint is wrong only with -mavx512f 
> -mevex512. To avoid someone change back the hint output, so I want to add the 
> option -mavx512vl, so the hint is right for now.
Can we use regexp in the hint to avoid any change in the future?
>
> BRs,
> Lin
>
> > >
> > > Bootstrapped and regtested on x86-64-pc-linux-gnu.
> > > I think it is an obvious commit, but I still waiting for some while.
> > > If someone have other suggestion.
> > >
> > > BRs,
> > > Lin
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.target/i386/pr117304-1.c: Add -mavx512vl.
> > > ---
> > >  gcc/testsuite/gcc.target/i386/pr117304-1.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr117304-1.c
> > > b/gcc/testsuite/gcc.target/i386/pr117304-1.c
> > > index fc1c5bfd3e3..da26f4bd1b7 100644
> > > --- a/gcc/testsuite/gcc.target/i386/pr117304-1.c
> > > +++ b/gcc/testsuite/gcc.target/i386/pr117304-1.c
> > > @@ -1,6 +1,6 @@
> > >  /* PR target/117304 */
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O2 -mavx512f -mno-evex512" } */
> > > +/* { dg-options "-O2 -mavx512f -mno-evex512 -mavx512vl" } */
> > >
> > >  typedef __attribute__((__vector_size__(32))) int __v8si;  typedef
> > > __attribute__((__vector_size__(32))) unsigned int __v8su;
> > > --
> > > 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.

2024-11-06 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 5:19 PM Jakub Jelinek  wrote:
>
> On Tue, Nov 05, 2024 at 05:12:56PM +0800, Hongtao Liu wrote:
> > Yes, there's a mismatch between scalar and vector code, I assume users
> > may not care much about precision/NAN/INF/denormal behaviors for
> > vector code.
> > Just like we support
> > #define RECIP_MASK_DEFAULT (RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT)
> >  but turn off
> > RECIP_MASK_DIV | RECIP_MASK_SQRT.
>
> Users who don't care should be using -ffast-math.  Users who do care
> should get proper behavior.
>
> > > I don't know what exactly the hw instructions do, whether they perform
> > > everything needed properly or just subset of it or none of it,
> >
> > Subset of it, hw instruction doesn't raise exceptions and always round
> > to nearest (even). Output denormals are always flushed to zero and
> > input denormals are always treated as zero. MXCSR is not consulted nor
> > updated.
>
> Does it turn the sNaNs into infinities or qNaNs silently?
Yes.
> Given the rounding, flag_rounding_math should avoid the hw instructions,
The default rounding mode for flag_rounding_math is rounding to
nearest, so I assume !flag_rounding_math is not needed for the
condition.

> and either HONOR_NANS or HONOR_SNANS should be used to predicate that.
>
> > > but the permutation fallback IMHO definitely needs to be guarded with
> > > the same flags as scalar code.
> > > For HONOR_NANS case or flag_rounding_math, the generic code (see expr.cc)
> > > uses the libgcc fallback.  Otherwise, generic code has
> > >   /* If we don't expect qNaNs nor sNaNs and can assume rounding
> > >  to nearest, we can expand the conversion inline as
> > >  (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16.  */
> > > and the backend has
> > > TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode)
> > > shift (i.e. just the permutation).
> > > Note, even that (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16
> > > is doable in vectors.
> >
> > If you're concerned about that, I'll commit another patch to align the
> > condition of the vector expander with scalar ones for both extendmn2
> > and truncmn2.
>
> For the fallback, for HONOR_NANS or flag_rounding_math we just shouldn't
> use the fallback at all.  For flag_unsafe_math_optimizations, we can just
> use the simple permutation, i.ew. fromi >> 16, otherwise can use that
> (fromi + 0x7fff + ((fromi >> 16) & 1) followed by the permutation.
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops

2024-11-06 Thread Hongtao Liu
On Thu, Nov 7, 2024 at 10:29 AM MayShao-oc  wrote:
>
> Hi all:
>For zhaoxin, I find no improvement when enable pass_align_tight_loops,
> and have performance drop in some cases.
>This patch add a new tunable to bypass pass_align_tight_loops in zhaoxin.
>
>Bootstrapped X86_64.
>Ok for trunk?
> BR
> Mayshao
> gcc/ChangeLog:
>
> * config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS):
> default true in all processors except for zhaoxin.
> * config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro.
> * config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS):
> New tune
> ---
>  gcc/config/i386/i386-features.cc | 4 +++-
>  gcc/config/i386/i386.h   | 3 +++
>  gcc/config/i386/x86-tune.def | 4 
>  3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index e2e85212a4f..d9fd92964fe 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -3620,7 +3620,9 @@ public:
>/* opt_pass methods: */
>bool gate (function *) final override
>  {
> -  return optimize && optimize_function_for_speed_p (cfun);
> +  return TARGET_ALIGN_TIGHT_LOOPS
> +&& optimize
> +&& optimize_function_for_speed_p (cfun);
>  }
>
>unsigned int execute (function *) final override
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 2dcd8803a08..7f9010246c2 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -466,6 +466,9 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
>  #define TARGET_USE_RCR ix86_tune_features[X86_TUNE_USE_RCR]
>  #define TARGET_SSE_MOVCC_USE_BLENDV \
> ix86_tune_features[X86_TUNE_SSE_MOVCC_USE_BLENDV]
> +#define TARGET_ALIGN_TIGHT_LOOPS \
> +ix86_tune_features[X86_TUNE_ALIGN_TIGHT_LOOPS]
> +
>
>  /* Feature tests against the various architecture variations.  */
>  enum ix86_arch_indices {
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 6ebb2fd3414..bd4fa8b3eee 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -542,6 +542,10 @@ DEF_TUNE (X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD,
>  DEF_TUNE (X86_TUNE_SSE_MOVCC_USE_BLENDV,
>   "sse_movcc_use_blendv", ~m_CORE_ATOM)
>
> +/* X86_TUNE_ALIGN_TIGHT_LOOPS: if false, tight loops are not aligned. */
> +DEF_TUNE (X86_TUNE_ALIGN_TIGHT_LOOPS, "align_tight_loops",
> +~(m_ZHAOXIN))
Please also add ~(m_ZHAOXIN | m_CASCADELAKE | m_SKYLAKE_AVX512))
And could you put it under the section of

 /*/
-/* Branch predictor tuning  */
+/* Branch predictor and The Front-end tuning
  */
 /*/
> +
>  
> /*/
>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too) 
> */
>  
> /*/
> --
> 2.27.0
>


--
BR,
Hongtao


Re: [PATCH] testsuite: Fix up pr116725.c test [PR116725]

2024-11-06 Thread Hongtao Liu
On Wed, Nov 6, 2024 at 4:59 PM Jakub Jelinek  wrote:
>
> On Fri, Oct 18, 2024 at 02:05:59PM -0400, Antoni Boucher wrote:
> > PR target/116725
> > * gcc.target/i386/pr116725.c: Add test using those AVX builtins.
>
> This test FAILs for me, as I don't have the latest gas around and the test
> is dg-do assemble, so doesn't need just fixed compiler, but also assembler
> which supports those instructions.
>
> The following patch adds effective target directives to ensure assembler
> supports those too.
>
> Tested on x86_64-linux, ok for trunk?
Ok.
>
> 2024-11-06  Jakub Jelinek  
>
> PR target/116725
> * gcc.target/i386/pr116725.c: Add dg-require-effective-target
> avx512{dq,fp16,vl}.
>
> --- gcc/testsuite/gcc.target/i386/pr116725.c.jj 2024-11-05 22:07:12.588795051 
> +0100
> +++ gcc/testsuite/gcc.target/i386/pr116725.c2024-11-06 09:54:41.545064629 
> +0100
> @@ -2,6 +2,9 @@
>  /* { dg-do assemble } */
>  /* { dg-options "-masm=intel -mavx512dq -mavx512fp16 -mavx512vl" } */
>  /* { dg-require-effective-target masm_intel } */
> +/* { dg-require-effective-target avx512dq } */
> +/* { dg-require-effective-target avx512fp16 } */
> +/* { dg-require-effective-target avx512vl } */
>
>  #include 
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Add OPTION_MASK_ISA2_EVEX512 for some AVX512 instructions.

2024-11-05 Thread Hongtao Liu
On Wed, Nov 6, 2024 at 10:35 AM Hu, Lin1  wrote:
>
> Hi, all
>
> This patch aims to add OPTION_MASK_ISA2_EVEX512 for all avx512 512-bits
> builtin functions, raise error when these builtin functions are used with
> -mno-evex512.
>
> Bootstrapped and Regtested on x86-64-pc-linux-gnu, OK for trunk and backport 
> to
> GCC14?
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> PR target/117304
> * config/i386/i386-builtin.def: Add OPTION_MASK_ISA2_EVEX512 for some
> AVX512 512-bits instructions.
>
> gcc/testsuite/ChangeLog:
>
> PR target/117304
> * gcc.target/i386/pr117304-1.c: New test.
> ---
>  gcc/config/i386/i386-builtin.def   | 10 
>  gcc/testsuite/gcc.target/i386/pr117304-1.c | 28 ++
>  2 files changed, 33 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr117304-1.c
>
> diff --git a/gcc/config/i386/i386-builtin.def 
> b/gcc/config/i386/i386-builtin.def
> index c484e6dc29e..26c23780b1c 100644
> --- a/gcc/config/i386/i386-builtin.def
> +++ b/gcc/config/i386/i386-builtin.def
> @@ -3357,11 +3357,11 @@ BDESC (OPTION_MASK_ISA_AVX512F, 0, 
> CODE_FOR_sse_cvtsi2ss_round, "__builtin_ia32_
>  BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_sse_cvtsi2ssq_round, "__builtin_ia32_cvtsi2ss64", 
> IX86_BUILTIN_CVTSI2SS64, UNKNOWN, (int) V4SF_FTYPE_V4SF_INT64_INT)
>  BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_sse2_cvtss2sd_round, 
> "__builtin_ia32_cvtss2sd_round", IX86_BUILTIN_CVTSS2SD_ROUND, UNKNOWN, (int) 
> V2DF_FTYPE_V2DF_V4SF_INT)
>  BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_sse2_cvtss2sd_mask_round, 
> "__builtin_ia32_cvtss2sd_mask_round", IX86_BUILTIN_CVTSS2SD_MASK_ROUND, 
> UNKNOWN, (int) V2DF_FTYPE_V2DF_V4SF_V2DF_UQI_INT)
> -BDESC (OPTION_MASK_ISA_AVX512F, 0, 
> CODE_FOR_unspec_fix_truncv8dfv8si2_mask_round, 
> "__builtin_ia32_cvttpd2dq512_mask", IX86_BUILTIN_CVTTPD2DQ512, UNKNOWN, (int) 
> V8SI_FTYPE_V8DF_V8SI_QI_INT)
> -BDESC (OPTION_MASK_ISA_AVX512F, 0, 
> CODE_FOR_unspec_fixuns_truncv8dfv8si2_mask_round, 
> "__builtin_ia32_cvttpd2udq512_mask", IX86_BUILTIN_CVTTPD2UDQ512, UNKNOWN, 
> (int) V8SI_FTYPE_V8DF_V8SI_QI_INT)
> -BDESC (OPTION_MASK_ISA_AVX512F, 0, 
> CODE_FOR_unspec_fix_truncv16sfv16si2_mask_round, 
> "__builtin_ia32_cvttps2dq512_mask", IX86_BUILTIN_CVTTPS2DQ512, UNKNOWN, (int) 
> V16SI_FTYPE_V16SF_V16SI_HI_INT)
> -BDESC (OPTION_MASK_ISA_AVX512F, 0, 
> CODE_FOR_unspec_fixuns_truncv16sfv16si2_mask_round, 
> "__builtin_ia32_cvttps2udq512_mask", IX86_BUILTIN_CVTTPS2UDQ512, UNKNOWN, 
> (int) V16SI_FTYPE_V16SF_V16SI_HI_INT)
> -BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_floatunsv16siv16sf2_mask_round, 
> "__builtin_ia32_cvtudq2ps512_mask", IX86_BUILTIN_CVTUDQ2PS512, UNKNOWN, (int) 
> V16SF_FTYPE_V16SI_V16SF_HI_INT)
> +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, 
> CODE_FOR_unspec_fix_truncv8dfv8si2_mask_round, 
> "__builtin_ia32_cvttpd2dq512_mask", IX86_BUILTIN_CVTTPD2DQ512, UNKNOWN, (int) 
> V8SI_FTYPE_V8DF_V8SI_QI_INT)
> +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, 
> CODE_FOR_unspec_fixuns_truncv8dfv8si2_mask_round, 
> "__builtin_ia32_cvttpd2udq512_mask", IX86_BUILTIN_CVTTPD2UDQ512, UNKNOWN, 
> (int) V8SI_FTYPE_V8DF_V8SI_QI_INT)
> +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, 
> CODE_FOR_unspec_fix_truncv16sfv16si2_mask_round, 
> "__builtin_ia32_cvttps2dq512_mask", IX86_BUILTIN_CVTTPS2DQ512, UNKNOWN, (int) 
> V16SI_FTYPE_V16SF_V16SI_HI_INT)
> +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, 
> CODE_FOR_unspec_fixuns_truncv16sfv16si2_mask_round, 
> "__builtin_ia32_cvttps2udq512_mask", IX86_BUILTIN_CVTTPS2UDQ512, UNKNOWN, 
> (int) V16SI_FTYPE_V16SF_V16SI_HI_INT)
> +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, 
> CODE_FOR_floatunsv16siv16sf2_mask_round, "__builtin_ia32_cvtudq2ps512_mask", 
> IX86_BUILTIN_CVTUDQ2PS512, UNKNOWN, (int) V16SF_FTYPE_V16SI_V16SF_HI_INT)
>  BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_cvtusi2sd64_round, "__builtin_ia32_cvtusi2sd64", 
> IX86_BUILTIN_CVTUSI2SD64, UNKNOWN, (int) V2DF_FTYPE_V2DF_UINT64_INT)
>  BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_cvtusi2ss32_round, 
> "__builtin_ia32_cvtusi2ss32", IX86_BUILTIN_CVTUSI2SS32, UNKNOWN, (int) 
> V4SF_FTYPE_V4SF_UINT_INT)
>  BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_cvtusi2ss64_round, "__builtin_ia32_cvtusi2ss64", 
> IX86_BUILTIN_CVTUSI2SS64, UNKNOWN, (int) V4SF_FTYPE_V4SF_UINT64_INT)
> diff --git a/gcc/testsuite/gcc.target/i386/pr117304-1.c 
> b/gcc/testsuite/gcc.target/i386/pr117304-1.c
> new file mode 100644
> index 000..68419338524
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr117304-1.c
> @@ -0,0 +1,28 @@
> +/* PR target/117304 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx10.1 -mno-evex512" } */
Please use -mavx512f -mno-evex512 to avoid warning when gcc is
configured --with-arch=native on avx512 machine.
Otherwise LGTM.
> +
> +typedef __attribute__((__vecto

Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt

2024-11-05 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 5:50 PM Mayshao-oc  wrote:
>
>
> >
> >
> > On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: MayShao-oc 
> > > > Sent: Tuesday, November 5, 2024 11:20 AM
> > > > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao
> > > > ; ubiz...@gmail.com
> > > > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com;
> > > > cobec...@zhaoxin.com
> > > > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt
> > > >
> > > > Hi all:
> > > > This patch add -malign-tight-loops flag to control 
> > > > pass_align_tight_loops.
> > > > The motivation is that pass_align_tight_loops may cause performance
> > > > regression in nested loops.
> > > >
> > > > The example code as follows:
> > > >
> > > > #define ITER 2
> > > > #define ITER_O 10
> > > >
> > > > int i, j,k;
> > > > int array[ITER];
> > > >
> > > > void loop()
> > > > {
> > > >   int i;
> > > >   for(k = 0; k < ITER_O; k++)
> > > >   for(j = 0; j < ITER; j++)
> > > >   for(i = 0; i < ITER; i++)
> > > >   {
> > > > array[i] += j;
> > > > array[i] += i;
> > > > array[i] += 2*j;
> > > > array[i] += 2*i;
> > > >   }
> > > > }
> > > >
> > > > When I compile it with gcc -O1 loop.c, the output assembly as 
> > > > follows.
> > > > It is not optimal, because of too many nops insert in the outer loop.
> > > >
> > > > 00400540 :
> > > >   400540: 48 83 ec 08 sub$0x8,%rsp
> > > >   400544: bf 0a 00 00 00  mov$0xa,%edi
> > > >   400549: b9 00 00 00 00  mov$0x0,%ecx
> > > >   40054e: 8d 34 09lea(%rcx,%rcx,1),%esi
> > > >   400551: b8 00 00 00 00  mov$0x0,%eax
> > > >   400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   40055d: 00 00 00 00
> > > >   400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   400568: 00 00 00 00
> > > >   40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   400573: 00 00 00 00
> > > >   400577: 66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
> > > >   40057e: 00 00
> > > >   400580: 89 ca   mov%ecx,%edx
> > > >   400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx
> > > >   400589: 01 c2   add%eax,%edx
> > > >   40058b: 01 f2   add%esi,%edx
> > > >   40058d: 8d 14 42lea(%rdx,%rax,2),%edx
> > > >   400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4)
> > > >   400597: 48 83 c0 01 add$0x1,%rax
> > > >   40059b: 48 3d 20 4e 00 00   cmp$0x4e20,%rax
> > > >   4005a1: 75 dd   jne400580 
> > > >
> > > >I benchmark this program in the intel Xeon, and find the 
> > > > optimization may
> > > > cause a 40% performance regression (6.6B cycles VS 9.3B cycles).
> > On SPR, align is 25% better than no_align case.
>
>   I found  no_align is 10% better in zhaoxin yongfeng, so I test this 
> program in Xeon 4210R, and found a
> 40% performance regression.So I think this maybe a general regression, and 
> need a flag to control.As you say,
> On SPR, align is better,so its not a general regression.
>  Could you please benchmark this in Xeon 4210R, or a similar arch?
>  I am not delve into intel 4210R arch, and I think a 40% drop is not 
> explainable.  Maybe I make a mistakes.
>  I could  confirm in zhaoxin yongfeng, no_align is 10% better.
>  I attach the Xeon 4210R cpuinfo, and the test binary for your 
> reference. Thanks.

I reproduce with 30% regression on CLX, there's more frontend-bound
with aligned case, it's uarch specific, will make it a uarch tune.

> >
> > > >So I propose to add -malign-tight-loops flag to control tight loop
> > > > optimization to avoid this, we could disalbe this optimization by 
> > > > default.
> > > >Bootstrapped X86_64.
> > > >Ok for trunk?
> > > >
> > > > BR
> > > > Mayshao
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >   * config/i386/i386-features.cc (ix86_align_tight_loops): New flag.
> > > >   * config/i386/i386.opt (malign-tight-loops): New option.
> > > >   * doc/invoke.texi (-malign-tight-loops): Document.
> > > > ---
> > > >  gcc/config/i386/i386-features.cc | 4 +++-
> > > >  gcc/config/i386/i386.opt | 4 
> > > >  gcc/doc/invoke.texi  | 7 ++-
> > > >  3 files changed, 13 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-
> > > > features.cc
> > > > index e2e85212a4f..f9546e00b07 100644
> > > > --- a/gcc/config/i386/i386-features.cc
> > > > +++ b/gcc/config/i386/i386-features.cc
> > > > @@ -3620,7 +3620,9 @@ public:
> > > >/* opt_pass methods: */
> > > >bool gate (funct

Re: [PATCH] gcc.target/i386/apx-ndd.c: Also scan (%edi)

2024-11-05 Thread Hongtao Liu
On Wed, Nov 6, 2024 at 8:19 AM H.J. Lu  wrote:
>
> Since x32 uses (%edi), instead of (%rdi), also scan (%edi).
>
> * gcc.target/i386/apx-ndd.c: Also scan (%edi).
Ok.
>
> --
> H.J.



-- 
BR,
Hongtao


Re: [PATCH] Intel MOVRS tests: Also scan (%e.x)

2024-11-05 Thread Hongtao Liu
On Wed, Nov 6, 2024 at 8:21 AM H.J. Lu  wrote:
>
> Since x32 uses (%reg32), instead of (%r.x), also scan (%e.x).
>
> * gcc.target/i386/avx10_2-512-movrs-1.c: Also scan (%e.x).
> * gcc.target/i386/avx10_2-movrs-1.c: Likewise.
> * gcc.target/i386/movrs-1.c: Likewise.
Ok.
>
> --
> H.J.



-- 
BR,
Hongtao


Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt

2024-11-05 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 5:33 PM Richard Biener
 wrote:
>
> On Tue, Nov 5, 2024 at 8:12 AM Hongtao Liu  wrote:
> >
> > On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: MayShao-oc 
> > > > Sent: Tuesday, November 5, 2024 11:20 AM
> > > > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao
> > > > ; ubiz...@gmail.com
> > > > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com;
> > > > cobec...@zhaoxin.com
> > > > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt
> > > >
> > > > Hi all:
> > > > This patch add -malign-tight-loops flag to control 
> > > > pass_align_tight_loops.
> > > > The motivation is that pass_align_tight_loops may cause performance
> > > > regression in nested loops.
> > > >
> > > > The example code as follows:
> > > >
> > > > #define ITER 2
> > > > #define ITER_O 10
> > > >
> > > > int i, j,k;
> > > > int array[ITER];
> > > >
> > > > void loop()
> > > > {
> > > >   int i;
> > > >   for(k = 0; k < ITER_O; k++)
> > > >   for(j = 0; j < ITER; j++)
> > > >   for(i = 0; i < ITER; i++)
> > > >   {
> > > > array[i] += j;
> > > > array[i] += i;
> > > > array[i] += 2*j;
> > > > array[i] += 2*i;
> > > >   }
> > > > }
> > > >
> > > > When I compile it with gcc -O1 loop.c, the output assembly as 
> > > > follows.
> > > > It is not optimal, because of too many nops insert in the outer loop.
> > > >
> > > > 00400540 :
> > > >   400540: 48 83 ec 08 sub$0x8,%rsp
> > > >   400544: bf 0a 00 00 00  mov$0xa,%edi
> > > >   400549: b9 00 00 00 00  mov$0x0,%ecx
> > > >   40054e: 8d 34 09lea(%rcx,%rcx,1),%esi
> > > >   400551: b8 00 00 00 00  mov$0x0,%eax
> > > >   400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   40055d: 00 00 00 00
> > > >   400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   400568: 00 00 00 00
> > > >   40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> > > >   400573: 00 00 00 00
> > > >   400577: 66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
> > > >   40057e: 00 00
> > > >   400580: 89 ca   mov%ecx,%edx
> > > >   400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx
> > > >   400589: 01 c2   add%eax,%edx
> > > >   40058b: 01 f2   add%esi,%edx
> > > >   40058d: 8d 14 42lea(%rdx,%rax,2),%edx
> > > >   400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4)
> > > >   400597: 48 83 c0 01 add$0x1,%rax
> > > >   40059b: 48 3d 20 4e 00 00   cmp$0x4e20,%rax
> > > >   4005a1: 75 dd   jne400580 
> > > >
> > > >I benchmark this program in the intel Xeon, and find the 
> > > > optimization may
> > > > cause a 40% performance regression (6.6B cycles VS 9.3B cycles).
> > On SPR, align is 25% better than no_align case.
>
> that would ask for a tunable rather than a new flag then?
Good idea.
>
> Not knowing much about the pass and how it affects -falign-loops=N - is that
> flag still honored when the pass is switched off?
The pass will rewrite -falign-loops=N whenever the loop size is
smaller than cache line. Otherwise -falign-loops=N is still honored.
>
> >
> > > >So I propose to add -malign-tight-loops flag to control tight loop
> > > > optimization to avoid this, we could disalbe this optimization by 
> > > > default.
> > > >Bootstrapped X86_64.
> > > >Ok for trunk?
> > > >
> > > > BR
> > > > Mayshao
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >   * config/i386/i386-features.cc (ix86_align_tight_loops): New flag.
> > > >   * config/i386/i386.opt (malign-tight-loop

Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.

2024-11-05 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 4:46 PM Jakub Jelinek  wrote:
>
> On Tue, Oct 29, 2024 at 07:19:38PM -0700, liuhongt wrote:
> > Generate native instruction whenever possible, otherwise use vector
> > permutation with odd indices.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready push to trunk.
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-expand.cc
> >   (ix86_expand_vector_sf2bf_with_vec_perm): New function.
> >   * config/i386/i386-protos.h
> >   (ix86_expand_vector_sf2bf_with_vec_perm): New declare.
> >   * config/i386/mmx.md (truncv2sfv2bf2): New expander.
> >   * config/i386/sse.md (truncv4sfv4bf2): Ditto.
> >   (truncv8sfv8bf2): Ditto.
> >   (truncv16sfv16bf2): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/i386/avx512bf16-truncsfbf.c: New test.
> >   * gcc.target/i386/avx512bw-truncsfbf.c: New test.
> >   * gcc.target/i386/ssse3-truncsfbf.c: New test.
>
> Is that correct for non-ffast-math?
> I mean, truncation from SF to BFmode e.g. when honoring NaNs definitely
> isn't a simple permutation.
> A SFmode sNaN which has non-zero bits in the mantissa only in the lower
> 16-bits would be silently turned into +-Inf rather than raise exception
> and turn it into a qNaN.
> Similarly, the result when not using -ffast-math needs to be correctly
> rounded (according to the current rounding mode, at least with
> -frounding-math, otherwise at least for round to even), permutation
> definitely doesn't achieve that.

Yes, there's a mismatch between scalar and vector code, I assume users
may not care much about precision/NAN/INF/denormal behaviors for
vector code.
Just like we support
#define RECIP_MASK_DEFAULT (RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT)
 but turn off
RECIP_MASK_DIV | RECIP_MASK_SQRT.

>
> I don't know what exactly the hw instructions do, whether they perform
> everything needed properly or just subset of it or none of it,

Subset of it, hw instruction doesn't raise exceptions and always round
to nearest (even). Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor
updated.

> but the permutation fallback IMHO definitely needs to be guarded with
> the same flags as scalar code.
> For HONOR_NANS case or flag_rounding_math, the generic code (see expr.cc)
> uses the libgcc fallback.  Otherwise, generic code has
>   /* If we don't expect qNaNs nor sNaNs and can assume rounding
>  to nearest, we can expand the conversion inline as
>  (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16.  */
> and the backend has
> TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode)
> shift (i.e. just the permutation).
> Note, even that (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16
> is doable in vectors.

If you're concerned about that, I'll commit another patch to align the
condition of the vector expander with scalar ones for both extendmn2
and truncmn2.

>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt

2024-11-04 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao  wrote:
>
>
>
> > -Original Message-
> > From: MayShao-oc 
> > Sent: Tuesday, November 5, 2024 11:20 AM
> > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao
> > ; ubiz...@gmail.com
> > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com;
> > cobec...@zhaoxin.com
> > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt
> >
> > Hi all:
> > This patch add -malign-tight-loops flag to control 
> > pass_align_tight_loops.
> > The motivation is that pass_align_tight_loops may cause performance
> > regression in nested loops.
> >
> > The example code as follows:
> >
> > #define ITER 2
> > #define ITER_O 10
> >
> > int i, j,k;
> > int array[ITER];
> >
> > void loop()
> > {
> >   int i;
> >   for(k = 0; k < ITER_O; k++)
> >   for(j = 0; j < ITER; j++)
> >   for(i = 0; i < ITER; i++)
> >   {
> > array[i] += j;
> > array[i] += i;
> > array[i] += 2*j;
> > array[i] += 2*i;
> >   }
> > }
> >
> > When I compile it with gcc -O1 loop.c, the output assembly as follows.
> > It is not optimal, because of too many nops insert in the outer loop.
> >
> > 00400540 :
> >   400540: 48 83 ec 08 sub$0x8,%rsp
> >   400544: bf 0a 00 00 00  mov$0xa,%edi
> >   400549: b9 00 00 00 00  mov$0x0,%ecx
> >   40054e: 8d 34 09lea(%rcx,%rcx,1),%esi
> >   400551: b8 00 00 00 00  mov$0x0,%eax
> >   400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> >   40055d: 00 00 00 00
> >   400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> >   400568: 00 00 00 00
> >   40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
> >   400573: 00 00 00 00
> >   400577: 66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
> >   40057e: 00 00
> >   400580: 89 ca   mov%ecx,%edx
> >   400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx
> >   400589: 01 c2   add%eax,%edx
> >   40058b: 01 f2   add%esi,%edx
> >   40058d: 8d 14 42lea(%rdx,%rax,2),%edx
> >   400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4)
> >   400597: 48 83 c0 01 add$0x1,%rax
> >   40059b: 48 3d 20 4e 00 00   cmp$0x4e20,%rax
> >   4005a1: 75 dd   jne400580 
> >
> >I benchmark this program in the intel Xeon, and find the optimization may
> > cause a 40% performance regression (6.6B cycles VS 9.3B cycles).
On SPR, align is 25% better than no_align case.

> >So I propose to add -malign-tight-loops flag to control tight loop
> > optimization to avoid this, we could disalbe this optimization by default.
> >Bootstrapped X86_64.
> >Ok for trunk?
> >
> > BR
> > Mayshao
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-features.cc (ix86_align_tight_loops): New flag.
> >   * config/i386/i386.opt (malign-tight-loops): New option.
> >   * doc/invoke.texi (-malign-tight-loops): Document.
> > ---
> >  gcc/config/i386/i386-features.cc | 4 +++-
> >  gcc/config/i386/i386.opt | 4 
> >  gcc/doc/invoke.texi  | 7 ++-
> >  3 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-
> > features.cc
> > index e2e85212a4f..f9546e00b07 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -3620,7 +3620,9 @@ public:
> >/* opt_pass methods: */
> >bool gate (function *) final override
> >  {
> > -  return optimize && optimize_function_for_speed_p (cfun);
> > +  return ix86_align_tight_loops
> > +&& optimize
> > +&& optimize_function_for_speed_p (cfun);
> >  }
> >
> >unsigned int execute (function *) final override diff --git
> > a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index
> > 64c295d344c..ec41de192bc 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -1266,6 +1266,10 @@ mlam=
> >  Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type)
> > Init(lam_none)  -mlam=[none|u48|u57] Instrument meta data position in
> > user data pointers.
> >
> > +malign-tight-loops
> > +Target Var(ix86_align_tight_loops) Init(0) Optimization Enable align
> > +tight loops.
>
> I'd like it to be on by default, so Init (1)?
>
> > +
> >  Enum
> >  Name(lam_type) Type(enum lam_type) UnknownError(unknown lam
> > type %qs)
> >
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index
> > 07920e07b4d..9ec1e1f0095 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -1510,7 +1510,7 @@ See RS/6000 and PowerPC Options.
> >  -mindirect-branch=@var{choice}  -mfunction-return=@var{choice}  -
> > mindirect-branch-register -mharden-sls=@var

Re: [PATCH v2] i386: Handling exception input of __builtin_ia32_prefetch. [PR117416]

2024-11-04 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 2:41 PM Hu, Lin1  wrote:
>
> > -Original Message-
> > From: Hu, Lin1 
> > Sent: Tuesday, November 5, 2024 1:34 PM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Liu, Hongtao ; ubiz...@gmail.com
> > Subject: [PATCH v2] i386: Handling exception input of
> > __builtin_ia32_prefetch. [PR117416]
> >
> > Add handler for op3, and the previously stated fail is a random fail not 
> > related
> > to this change, OK for trunk?
> >
>
> The fail mentioned here is gcc.dg/tortune/convert-dfp.c triggered by test 
> environment
>
> Its output is "i386 architecture of input file 
> `./convert-dfp.ltrans0.ltrans.o' is incompatible with i386:x86-64 output."
>
> When I test my patch in another test environment, the fail disappeared, it 
> looks like the fail isn't related to this patch from
> the test result and the output. I think this part of the change is safe.
Ok for the commit.
>
> BRs,
> Lin
>
> >
> > op1 should be between 0 and 2. Add an error handler, and op3 should be 0 or
> > 1, raise a warning, when op3 is an invalid value.
> >
> > gcc/ChangeLog:
> >
> >   PR target/117416
> >   * config/i386/i386-expand.cc (ix86_expand_builtin): Raise warning
> > when
> >   op1 isn't in range of [0, 2] and set op1 as const0_rtx, and raise
> >   warning when op3 isn't in range of [0, 1].
> >
> > gcc/testsuite/ChangeLog:
> >
> >   PR target/117416
> >   * gcc.target/i386/pr117416-1.c: New test.
> >   * gcc.target/i386/pr117416-2.c: Ditto.
> > ---
> >  gcc/config/i386/i386-expand.cc | 11 +++
> >  gcc/testsuite/gcc.target/i386/pr117416-1.c | 12 
> > gcc/testsuite/gcc.target/i386/pr117416-2.c | 12 
> >  3 files changed, 35 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-2.c
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 515334aa5a3..fcd4b3b67b7 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -14194,6 +14194,13 @@ ix86_expand_builtin (tree exp, rtx target, rtx
> > subtarget,
> >   return const0_rtx;
> > }
> >
> > + if (!IN_RANGE (INTVAL (op1), 0, 2))
> > +   {
> > + warning (0, "invalid second argument to"
> > +  " %<__builtin_ia32_prefetch%>; using zero");
> > + op1 = const0_rtx;
> > +   }
> > +
> >   if (INTVAL (op3) == 1)
> > {
> >   if (INTVAL (op2) < 2 || INTVAL (op2) > 3) @@ -14216,6 +14223,10
> > @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget,
> > }
> >   else
> > {
> > + if (INTVAL (op3) != 0)
> > +   warning (0, "invalid forth argument to"
> > +   " %<__builtin_ia32_prefetch%>; using zero");
> > +
> >   if (!address_operand (op0, VOIDmode))
> > {
> >   op0 = convert_memory_address (Pmode, op0); diff --git
> > a/gcc/testsuite/gcc.target/i386/pr117416-1.c
> > b/gcc/testsuite/gcc.target/i386/pr117416-1.c
> > new file mode 100644
> > index 000..7062f27e21a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr117416-1.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O0" } */
> > +
> > +#include 
> > +
> > +void* p;
> > +
> > +void extern
> > +prefetch_test (void)
> > +{
> > +  __builtin_ia32_prefetch (p, 5, 0, 0); /* { dg-warning "invalid second
> > +argument to '__builtin_ia32_prefetch'; using zero" } */ }
> > diff --git a/gcc/testsuite/gcc.target/i386/pr117416-2.c
> > b/gcc/testsuite/gcc.target/i386/pr117416-2.c
> > new file mode 100644
> > index 000..1397645cbfc
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr117416-2.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O0" } */
> > +
> > +#include 
> > +
> > +void* p;
> > +
> > +void extern
> > +prefetch_test (void)
> > +{
> > +  __builtin_ia32_prefetch (p, 0, 0, 2); /* { dg-warning "invalid forth
> > +argument to '__builtin_ia32_prefetch'; using zero" } */ }
> > --
> > 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Handling exception input of __builtin_ia32_prefetch. [PR117416]

2024-11-04 Thread Hongtao Liu
On Tue, Nov 5, 2024 at 10:52 AM Hu, Lin1  wrote:
>
> Hi, all
>
> __builtin_ia32_prefetch's op1 should be between 0 and 2. So add an error 
> handler.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu, there is a unrelated FAIL
> that has yet to be found root cause, just send patch for review.
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> PR target/117416
> * config/i386/i386-expand.cc (ix86_expand_builtin): Raise warning when
> op1 isn't in range of (0, 2) and set op1 as const0_rtx;
>
> gcc/testsuite/ChangeLog:
>
> PR target/117416
> * gcc.target/i386/pr117416-1.c: New test.
> ---
>  gcc/config/i386/i386-expand.cc |  7 +++
>  gcc/testsuite/gcc.target/i386/pr117416-1.c | 12 
>  2 files changed, 19 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-1.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 515334aa5a3..5dab5859463 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -14194,6 +14194,13 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
> subtarget,
> return const0_rtx;
>   }
>
> +   if (!IN_RANGE (INTVAL (op1), 0, 2))
> + {
> +   warning (0, "invalid second argument to"
> +" %<__builtin_ia32_prefetch%>; using zero");
> +   op1 = const0_rtx;
> + }
> +
op3 should be handled similarly, 1 indicates for instruction prefetch,
0 for data prefetch.
> if (INTVAL (op3) == 1)
>   {
> if (INTVAL (op2) < 2 || INTVAL (op2) > 3)
> diff --git a/gcc/testsuite/gcc.target/i386/pr117416-1.c 
> b/gcc/testsuite/gcc.target/i386/pr117416-1.c
> new file mode 100644
> index 000..7062f27e21a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr117416-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O0" } */
> +
> +#include 
> +
> +void* p;
> +
> +void extern
> +prefetch_test (void)
> +{
> +  __builtin_ia32_prefetch (p, 5, 0, 0); /* { dg-warning "invalid second 
> argument to '__builtin_ia32_prefetch'; using zero" } */
> +}
> --
> 2.31.1
>


--
BR,
Hongtao


Re: [PATCH 0/2] Add arch support for Intel CPUs

2024-11-04 Thread Hongtao Liu
On Fri, Nov 1, 2024 at 11:24 AM Haochen Jiang  wrote:
>
> Hi all,
>
> I have just landed new ISA patches on trunk. The next step will
> be the arch support for ISE055 mentioned CPUs.
>
> There are two changes in ISE055 on CPUs:
>
>   - A new model number is added for Arrow Lake.
>   - Diamond Rapids Support is added.
>
> The following two patches will reflect those changes.
>
> Bootstraped and tested on x86_64-pc-linux-gnu. Ok for trunk and
> ARL patch backport to GCC14?
Ok.
>
> Ref: https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> Thx,
> Haochen
>
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Utilize VCOMSBF16 for BF16 Comparisons with AVX10.2

2024-11-03 Thread Hongtao Liu
On Fri, Nov 1, 2024 at 8:33 AM Hongyu Wang  wrote:
>
> From: Levy Hsu 
>
> This patch enables the use of the VCOMSBF16 instruction from AVX10.2 for
> efficient BF16 comparisons.
>
> Bootstrapped & regtested on x86-64-pc-linux-gnu.
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (ix86_expand_branch): Handle BFmode
> when TARGET_AVX10_2_256 is enabled.
> (ix86_prepare_fp_compare_args): Use SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P.
> (ix86_expand_fp_movcc): Ditto.
> (ix86_expand_fp_compare): Handle BFmode under IX86_FPCMP_COMI.
> * config/i386/i386.cc (ix86_multiplication_cost): Use
> SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P.
> (ix86_division_cost): Ditto.
> (ix86_rtx_costs): Ditto.
> (ix86_vector_costs::add_stmt_cost): Ditto.
> * config/i386/i386.h (SSE_FLOAT_MODE_SSEMATH_OR_HF_P): Rename to ...
> (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P): ...this, and add BFmode.
> * config/i386/i386.md (*cmpibf): New define_insn.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-comibf-1.c: New test.
> * gcc.target/i386/avx10_2-comibf-2.c: Ditto.
> ---
>  gcc/config/i386/i386-expand.cc|  22 ++--
>  gcc/config/i386/i386.cc   |  22 ++--
>  gcc/config/i386/i386.h|   7 +-
>  gcc/config/i386/i386.md   |  33 +++--
>  .../gcc.target/i386/avx10_2-comibf-1.c|  40 ++
>  .../gcc.target/i386/avx10_2-comibf-2.c| 118 ++
>  6 files changed, 214 insertions(+), 28 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-comibf-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-comibf-2.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 0de0e842731..96e4659da10 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -2531,6 +2531,10 @@ ix86_expand_branch (enum rtx_code code, rtx op0, rtx 
> op1, rtx label)
>emit_jump_insn (gen_rtx_SET (pc_rtx, tmp));
>return;
>
> +case E_BFmode:
> +  gcc_assert (TARGET_AVX10_2_256 && !flag_trapping_math);
> +  goto simple;
> +
>  case E_DImode:
>if (TARGET_64BIT)
> goto simple;
> @@ -2797,9 +2801,9 @@ ix86_prepare_fp_compare_args (enum rtx_code code, rtx 
> *pop0, rtx *pop1)
>bool unordered_compare = ix86_unordered_fp_compare (code);
>rtx op0 = *pop0, op1 = *pop1;
>machine_mode op_mode = GET_MODE (op0);
> -  bool is_sse = SSE_FLOAT_MODE_SSEMATH_OR_HF_P (op_mode);
> +  bool is_sse = SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (op_mode);
>
> -  if (op_mode == BFmode)
> +  if (op_mode == BFmode && (!TARGET_AVX10_2_256 || flag_trapping_math))
>  {
>rtx op = gen_lowpart (HImode, op0);
>if (CONST_INT_P (op))
> @@ -2918,10 +2922,14 @@ ix86_expand_fp_compare (enum rtx_code code, rtx op0, 
> rtx op1)
>  {
>  case IX86_FPCMP_COMI:
>tmp = gen_rtx_COMPARE (CCFPmode, op0, op1);
> -  if (TARGET_AVX10_2_256 && (code == EQ || code == NE))
> -   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_OPTCOMX);
> -  if (unordered_compare)
> -   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_NOTRAP);
> +  /* We only have vcomsbf16, No vcomubf16 nor vcomxbf16 */
> +  if (GET_MODE (op0) != E_BFmode)
> +   {
> + if (TARGET_AVX10_2_256 && (code == EQ || code == NE))
> +   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), 
> UNSPEC_OPTCOMX);
> + if (unordered_compare)
> +   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), 
> UNSPEC_NOTRAP);
> +   }
>cmp_mode = CCFPmode;
>emit_insn (gen_rtx_SET (gen_rtx_REG (CCFPmode, FLAGS_REG), tmp));
>break;
> @@ -4636,7 +4644,7 @@ ix86_expand_fp_movcc (rtx operands[])
>&& !ix86_fp_comparison_operator (operands[1], VOIDmode))
>  return false;
>
> -  if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode))
> +  if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
>  {
>machine_mode cmode;
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 473e4cbf10e..6ac3a5d55f2 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -21324,7 +21324,7 @@ ix86_multiplication_cost (const struct 
> processor_costs *cost,
>if (VECTOR_MODE_P (mode))
>  inner_mode = GET_MODE_INNER (mode);
>
> -  if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode))
> +  if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
>  return inner_mode == DFmode ? cost->mulsd : cost->mulss;
>else if (X87_FLOAT_MODE_P (mode))
>  return cost->fmul;
> @@ -21449,7 +21449,7 @@ ix86_division_cost (const struct processor_costs 
> *cost,
>if (VECTOR_MODE_P (mode))
>  inner_mode = GET_MODE_INNER (mode);
>
> -  if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode))
> +  if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
>  return inner_mode == DFmode ? cost->divsd : cost->divss;
>  

Re: [PATCH v3 7/8] i386: Add else operand to masked loads.

2024-11-03 Thread Hongtao Liu
On Sat, Nov 2, 2024 at 8:58 PM Robin Dapp  wrote:
>
> From: Robin Dapp 
>
> This patch adds a zero else operand to masked loads, in particular the
> masked gather load builtins that are used for gather vectorization.
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (ix86_expand_special_args_builtin):
> Add else-operand handling.
> (ix86_expand_builtin): Ditto.
> * config/i386/predicates.md (vcvtne2ps2bf_parallel): New
> predicate.
> (maskload_else_operand): Ditto.
> * config/i386/sse.md: Use predicate.
> ---
>  gcc/config/i386/i386-expand.cc |  26 ++--
>  gcc/config/i386/predicates.md  |   4 ++
>  gcc/config/i386/sse.md | 112 +
>  3 files changed, 97 insertions(+), 45 deletions(-)
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 0de0e842731..6c61f9f87c2 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -12995,10 +12995,11 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  {
>tree arg;
>rtx pat, op;
> -  unsigned int i, nargs, arg_adjust, memory;
> +  unsigned int i, nargs, arg_adjust, memory = -1;
>unsigned int constant = 100;
>bool aligned_mem = false;
> -  rtx xops[4];
> +  rtx xops[4] = {};
> +  bool add_els = false;
>enum insn_code icode = d->icode;
>const struct insn_data_d *insn_p = &insn_data[icode];
>machine_mode tmode = insn_p->operand[0].mode;
> @@ -13125,6 +13126,9 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  case V4DI_FTYPE_PCV4DI_V4DI:
>  case V4SI_FTYPE_PCV4SI_V4SI:
>  case V2DI_FTYPE_PCV2DI_V2DI:
> +  /* Two actual args but an additional else operand.  */
> +  add_els = true;
> +  /* Fallthru.  */
>  case VOID_FTYPE_INT_INT64:
>nargs = 2;
>klass = load;
> @@ -13397,6 +13401,12 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>xops[i]= op;
>  }
>
> +  if (add_els)
> +{
> +  xops[i] = CONST0_RTX (GET_MODE (xops[0]));
> +  nargs++;
> +}
> +
>switch (nargs)
>  {
>  case 0:
> @@ -13653,7 +13663,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
> subtarget,
>enum insn_code icode, icode2;
>tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
>tree arg0, arg1, arg2, arg3, arg4;
> -  rtx op0, op1, op2, op3, op4, pat, pat2, insn;
> +  rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn;
>machine_mode mode0, mode1, mode2, mode3, mode4;
>unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
>HOST_WIDE_INT bisa, bisa2;
> @@ -15560,12 +15570,15 @@ rdseed_step:
>   op3 = copy_to_reg (op3);
>   op3 = lowpart_subreg (mode3, op3, GET_MODE (op3));
> }
> +
>if (!insn_data[icode].operand[5].predicate (op4, mode4))
> {
> -  error ("the last argument must be scale 1, 2, 4, 8");
> -  return const0_rtx;
> + error ("the last argument must be scale 1, 2, 4, 8");
> + return const0_rtx;
> }
>
> +  opels = CONST0_RTX (GET_MODE (subtarget));
> +
>/* Optimize.  If mask is known to have all high bits set,
>  replace op0 with pc_rtx to signal that the instruction
>  overwrites the whole destination and doesn't use its
> @@ -15634,7 +15647,8 @@ rdseed_step:
> }
> }
>
> -  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
> +  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels);
> +
>if (! pat)
> return const0_rtx;
>emit_insn (pat);
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 053312bbe27..7c7d8f61f11 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand"
>
>return true;
>  })
> +
> +(define_predicate "maskload_else_operand"
> +  (and (match_code "const_int,const_vector")
> +   (match_test "op == CONST0_RTX (GET_MODE (op))")))
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 36f8567b66f..41c1badbc00 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -28632,7 +28632,7 @@ (define_insn 
> "_maskstore"
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "")])
>
> -(define_expand "maskload"
> +(define_expand "maskload_1"
>[(set (match_operand:V48_128_256 0 "register_operand")
> (unspec:V48_128_256
>   [(match_operand: 2 "register_operand")
> @@ -28640,13 +28640,28 @@ (define_expand "maskload"
>   UNSPEC_MASKMOV))]
>"TARGET_AVX")
>
> +(define_expand "maskload"
> +  [(set (match_operand:V48_128_256 0 "register_operand")
> +   (unspec:V48_128_256
> + [(match_operand: 2 "register_operand")
> +  (match_operand:V48_128_256 1 "memory_operand")
> +  (match_operand:V48_128_256 3 "const0_operand"

Re: [PATCH] [APX PPX] Avoid generating unmatched pushp/popp in pro/epilogue

2024-10-30 Thread Hongtao Liu
On Thu, Jul 4, 2024 at 11:00 AM Hongtao Liu  wrote:
>
> On Tue, Jul 2, 2024 at 11:24 AM Hongyu Wang  wrote:
> >
> > Hi,
> >
> > According to APX spec, the pushp/popp pairs should be matched,
> > otherwise the PPX hint cannot take effect and cause performance loss.
> >
> > In the ix86_expand_epilogue, there are several optimizations that may
> > cause the epilogue using mov to restore the regs. Check if PPX applied
> > and prevent usage of mov/leave in the epilogue.
> >
> > Bootstrapped/regtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> Ok.
Please backport the fix to GCC14 branch.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.cc (ix86_expand_prologue): Set apx_ppx_used
> > flag in m.fs with TARGET_APX_PPX && !crtl->calls_eh_return.
> > (ix86_emit_save_regs): Emit ppx is available only when
> > TARGET_APX_PPX && !crtl->calls_eh_return.
> > (ix86_expand_epilogue): Don't restore reg using mov when
> > apx_ppx_used flag is true.
> > * config/i386/i386.h (struct machine_frame_state):
> > Add apx_ppx_used flag.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/apx-ppx-2.c: New test.
> > * gcc.target/i386/apx-ppx-3.c: Likewise.
> > ---
> >  gcc/config/i386/i386.cc   | 13 +
> >  gcc/config/i386/i386.h|  4 
> >  gcc/testsuite/gcc.target/i386/apx-ppx-2.c | 14 ++
> >  gcc/testsuite/gcc.target/i386/apx-ppx-3.c |  7 +++
> >  4 files changed, 34 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ppx-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ppx-3.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index bd7411190af..99def8d4a77 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -7429,6 +7429,7 @@ ix86_emit_save_regs (void)
> >  {
> >int regno;
> >rtx_insn *insn;
> > +  bool use_ppx = TARGET_APX_PPX && !crtl->calls_eh_return;
> >
> >if (!TARGET_APX_PUSH2POP2
> >|| !ix86_can_use_push2pop2 ()
> > @@ -7438,7 +7439,7 @@ ix86_emit_save_regs (void)
> > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> >   {
> > insn = emit_insn (gen_push (gen_rtx_REG (word_mode, regno),
> > -   TARGET_APX_PPX));
> > +   use_ppx));
> > RTX_FRAME_RELATED_P (insn) = 1;
> >   }
> >  }
> > @@ -7469,7 +7470,7 @@ ix86_emit_save_regs (void)
> >   
> > regno_list[0]),
> >  gen_rtx_REG (word_mode,
> >   
> > regno_list[1]),
> > -TARGET_APX_PPX));
> > +use_ppx));
> > RTX_FRAME_RELATED_P (insn) = 1;
> > rtx dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc 
> > (3));
> >
> > @@ -7502,7 +7503,7 @@ ix86_emit_save_regs (void)
> > else
> >   {
> > insn = emit_insn (gen_push (gen_rtx_REG (word_mode, regno),
> > -   TARGET_APX_PPX));
> > +   use_ppx));
> > RTX_FRAME_RELATED_P (insn) = 1;
> > aligned = true;
> >   }
> > @@ -7511,7 +7512,7 @@ ix86_emit_save_regs (void)
> > {
> >   insn = emit_insn (gen_push (gen_rtx_REG (word_mode,
> >regno_list[0]),
> > - TARGET_APX_PPX));
> > + use_ppx));
> >   RTX_FRAME_RELATED_P (insn) = 1;
> > }
> >  }
> > @@ -8985,6 +8986,7 @@ ix86_expand_prologue (void)
> >if (!frame.save_regs_using_mov)
> > {
> >   ix86_emit_save_regs ();
> > + m->fs.apx_ppx_used = TARGET_APX_PPX && !crtl->calls_eh_return;
> >   int_registers_saved = true;
> >   gcc_assert (m->fs.sp_offset == frame.reg_save_offset);
> > }
> > @@ -9870,6 +9872,9 @@ ix86_expand_epilogue (int style)
> >   

Re: [PATCH v2 7/8] i386: Add else operand to masked loads.

2024-10-29 Thread Hongtao Liu
On Fri, Oct 18, 2024 at 10:23 PM Robin Dapp  wrote:
>
> This patch adds a zero else operand to masked loads, in particular the
> masked gather load builtins that are used for gather vectorization.
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (ix86_expand_special_args_builtin):
> Add else-operand handling.
> (ix86_expand_builtin): Ditto.
> * config/i386/predicates.md (vcvtne2ps2bf_parallel): New
> predicate.
> (maskload_else_operand): Ditto.
> * config/i386/sse.md: Use predicate.
> ---
>  gcc/config/i386/i386-expand.cc |  26 +--
>  gcc/config/i386/predicates.md  |   4 ++
>  gcc/config/i386/sse.md | 124 -
>  3 files changed, 101 insertions(+), 53 deletions(-)
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 63f5e348d64..f6a2c2d65b8 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -12994,10 +12994,11 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  {
>tree arg;
>rtx pat, op;
> -  unsigned int i, nargs, arg_adjust, memory;
> +  unsigned int i, nargs, arg_adjust, memory = -1;
>unsigned int constant = 100;
>bool aligned_mem = false;
> -  rtx xops[4];
> +  rtx xops[4] = {};
> +  bool add_els = false;
>enum insn_code icode = d->icode;
>const struct insn_data_d *insn_p = &insn_data[icode];
>machine_mode tmode = insn_p->operand[0].mode;
> @@ -13124,6 +13125,9 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>  case V4DI_FTYPE_PCV4DI_V4DI:
>  case V4SI_FTYPE_PCV4SI_V4SI:
>  case V2DI_FTYPE_PCV2DI_V2DI:
> +  /* Two actual args but an additional else operand.  */
> +  add_els = true;
> +  /* Fallthru.  */
>  case VOID_FTYPE_INT_INT64:
>nargs = 2;
>klass = load;
> @@ -13396,6 +13400,12 @@ ix86_expand_special_args_builtin (const struct 
> builtin_description *d,
>xops[i]= op;
>  }
>
> +  if (add_els)
> +{
> +  xops[i] = CONST0_RTX (GET_MODE (xops[0]));
> +  nargs++;
> +}
> +
>switch (nargs)
>  {
>  case 0:
> @@ -13652,7 +13662,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
> subtarget,
>enum insn_code icode, icode2;
>tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
>tree arg0, arg1, arg2, arg3, arg4;
> -  rtx op0, op1, op2, op3, op4, pat, pat2, insn;
> +  rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn;
>machine_mode mode0, mode1, mode2, mode3, mode4;
>unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
>HOST_WIDE_INT bisa, bisa2;
> @@ -15559,12 +15569,15 @@ rdseed_step:
>   op3 = copy_to_reg (op3);
>   op3 = lowpart_subreg (mode3, op3, GET_MODE (op3));
> }
> +
>if (!insn_data[icode].operand[5].predicate (op4, mode4))
> {
> -  error ("the last argument must be scale 1, 2, 4, 8");
> -  return const0_rtx;
> + error ("the last argument must be scale 1, 2, 4, 8");
> + return const0_rtx;
> }
>
> +  opels = CONST0_RTX (GET_MODE (subtarget));
> +
>/* Optimize.  If mask is known to have all high bits set,
>  replace op0 with pc_rtx to signal that the instruction
>  overwrites the whole destination and doesn't use its
> @@ -15633,7 +15646,8 @@ rdseed_step:
> }
> }
>
> -  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4);
> +  pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels);
> +
>if (! pat)
> return const0_rtx;
>emit_insn (pat);
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 053312bbe27..7c7d8f61f11 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand"
>
>return true;
>  })
> +
> +(define_predicate "maskload_else_operand"
> +  (and (match_code "const_int,const_vector")
> +   (match_test "op == CONST0_RTX (GET_MODE (op))")))
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a45b50ad732..83955eee5a0 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1575,7 +1575,8 @@ (define_expand "_load_mask"
>  }
>else if (MEM_P (operands[1]))
>  operands[1] = gen_rtx_UNSPEC (mode,
> -gen_rtvec(1, operands[1]),
> +gen_rtvec(2, operands[1],
> +  CONST0_RTX (mode)),
>  UNSPEC_MASKLOAD);
>  })
>
> @@ -1583,7 +1584,8 @@ (define_insn "*_load_mask"
>[(set (match_operand:V48_AVX512VL 0 "register_operand" "=v")
> (vec_merge:V48_AVX512VL
>   (unspec:V48_AVX512VL
> -   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")]
> +   [(match_operand:V48_AVX512VL 1 "memory_operand" "m")
> +(match_operand:V48_A

Re: [PATCH] testsuite: Adjust AVX10.2 check_effective_target

2024-10-29 Thread Hongtao Liu
On Tue, Oct 29, 2024 at 5:04 PM Haochen Jiang  wrote:
>
> Hi all,
>
> Since Binutils haven't fully merged all AVX10.2 insts, only testing
> one inst/intrin in AVX10.2 is never sufficient for check_effective_target.
> Like APX_F, use inline asm to do the target check.
>
> Testes w/ and w/o Binutils with full AVX10.2 support. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> PR target/117301
> * lib/target-supports.exp (check_effective_target_avx10_2):
> Use inline asm instead of intrin for check_effective_target.
> (check_effective_target_avx10_2_512): Ditto.
> ---
>  gcc/testsuite/lib/target-supports.exp | 34 +++
>  1 file changed, 14 insertions(+), 20 deletions(-)
>
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index 70f74d1e288..9c65fd0fd7b 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -10748,17 +10748,14 @@ proc check_effective_target_apxf { } {
>  # Return 1 if avx10.2 instructions can be compiled.
>  proc check_effective_target_avx10_2 { } {
>  return [check_no_compiler_messages avx10.2 object {
> -   typedef int __v8si __attribute__ ((__vector_size__ (32)));
> -   typedef char __mmask8;
> -
> -   __v8si
> -   _mm256_mask_vpdpbssd_epi32 (__v8si __A, __mmask8 __U,
> -   __v8si __B, __v8si __C)
> +   void
> +   foo ()
> {
> - return (__v8si) __builtin_ia32_vpdpbssd_v8si_mask ((__v8si)__A,
> -(__v8si)__B,
> -(__v8si)__C,
> -(__mmask8)__U);
> + __asm__ volatile ("vdpphps\t%ymm4, %ymm5, %ymm6");
> + __asm__ volatile ("vcvthf82ph\t%xmm5, %ymm6");
> + __asm__ volatile ("vaddnepbf16\t%ymm4, %ymm5, %ymm6");
> + __asm__ volatile ("vcvtph2ibs\t%ymm5, %ymm6");
> + __asm__ volatile ("vminmaxpd\t$123, %ymm4, %ymm5, %ymm6");
> }
>  } "-mavx10.2" ]
>  }
> @@ -10766,17 +10763,14 @@ proc check_effective_target_avx10_2 { } {
>  # Return 1 if avx10.2-512 instructions can be compiled.
>  proc check_effective_target_avx10_2_512 { } {
>  return [check_no_compiler_messages avx10.2-512 object {
> -   typedef int __v16si __attribute__ ((__vector_size__ (64)));
> -   typedef short __mmask16;
> -
> -   __v16si
> -   _mm512_vpdpbssd_epi32 (__v16si __A, __mmask16 __U,
> -  __v16si __B, __v16si __C)
> +   void
> +   foo ()
> {
> - return (__v16si) __builtin_ia32_vpdpbssd_v16si_mask ((__v16si)__A,
> -  (__v16si)__B,
> -  (__v16si)__C,
> -  
> (__mmask16)__U);
> + __asm__ volatile ("vdpphps\t%zmm4, %zmm5, %zmm6");
> + __asm__ volatile ("vcvthf82ph\t%ymm5, %zmm6");
> + __asm__ volatile ("vaddnepbf16\t%zmm4, %zmm5, %zmm6");
> + __asm__ volatile ("vcvtph2ibs\t%zmm5, %zmm6");
> + __asm__ volatile ("vminmaxpd\t$123, %zmm4, %zmm5, %zmm6");
> }
>  } "-mavx10.2-512" ]
>  }
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH 0/7] Support Intel Diamond Rapid new features

2024-10-28 Thread Hongtao Liu
On Tue, Oct 22, 2024 at 2:31 PM Haochen Jiang  wrote:
>
> Hi all,
>
> ISE054 has just been released and you can find doc from here:
>
> https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> Diamond Rapids features are added in this ISE, including AMX
> related instructions, SM4 EVEX extension and MOVRS/PREFETCHRST2.
>
> The following seven patches will add all the new features into GCC.
>
> After these patches, we will add Diamond Rapids arch option to
> GCC15.
>
> Bootstrapped and tested on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
>


-- 
BR,
Hongtao


Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions

2024-10-24 Thread Hongtao Liu
On Fri, Oct 25, 2024 at 12:19 AM Antoni Boucher  wrote:
>
> Thanks.
> Did you review the new patch?
> Can I push it to master?
Ok.
>
> Le 2024-10-20 à 22 h 01, Hongtao Liu a écrit :
> > On Sat, Oct 19, 2024 at 2:06 AM Antoni Boucher  wrote:
> >>
> >> Thanks for the review.
> >> Here's the updated patch.
> >>
> >> Le 2024-10-17 à 21 h 50, Hongtao Liu a écrit :
> >>> On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher  wrote:
> >>>>
> >>>> Hi.
> >>>> This is a patch for the bug 116725.
> >>>> I'm not sure if it is a good fix, but it seems to do the job.
> >>>> If you have suggestions for better comments than what I wrote that would
> >>>> explain what's happening, I'm open to suggestions.
> >>>
> >>>> @@ -7548,7 +7548,8 @@ (define_insn 
> >>>> "avx512fp16_vcvtph2_<
> >>>>   [(match_operand: 1 "" 
> >>>> "")]
> >>>>   UNSPEC_US_FIX_NOTRUNC))]
> >>>> "TARGET_AVX512FP16 && "
> >>>> -  
> >>>> "vcvtph2\t{%1, 
> >>>> %0|%0, %1}"
> >>>> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
> >>>> +  
> >>>> "vcvtph2\t{%1, 
> >>>> %0|%0, %X1}"
> >>> Could you define something like
> >>>
> >>>;; Pointer size override for 16-bit upper-convert modes (Intel asm 
> >>> dialect)
> >>>(define_mode_attr iptrh
> >>> [(V32HI "") (V16SI "") (V8DI "")
> >>>  (V16HI "") (V8SI "") (V4DI "q")
> >>>  (V8HI "") (V4SI "q") (V2DI "k")])
> >>
> >> For my own understanding, was my usage of %X equivalent to a mode_attr
> >> with an empty string for all cases?
> >> How did you know which one needed an empty string?
> >
> > It's in ix86_print_operand
> > 14155  else if (MEM_P (x))
> > 14156{
> > 14157  rtx addr = XEXP (x, 0);
> > 14158
> > 14159  /* No `byte ptr' prefix for call instructions ... */
> > 14160  if (ASSEMBLER_DIALECT == ASM_INTEL && code != 'X' && code != 'P')
> > 14161{
> > 14162  machine_mode mode = GET_MODE (x);
> > 14163  const char *size;
> > 14164
> > 14165  /* Check for explicit size override codes.  */
> > 14166  if (code == 'b')
> > 14167size = "BYTE";
> > 14168  else if (code == 'w')
> > 14169size = "WORD";
> > 14170  else if (code == 'k')
> > 14171size = "DWORD";
> > 14172  else if (code == 'q')
> > 14173size = "QWORD";
> > 14174  else if (code == 'x')
> > 14175size = "XMMWORD";
> > 14176  else if (code == 't')
> > 14177size = "YMMWORD";
> > 14178  else if (code == 'g')
> > 14179size = "ZMMWORD";
> > 14180  else if (mode == BLKmode)
> > 14181/* ... or BLKmode operands, when not overridden.  */
> > 14182size = NULL;
> > 14183  else
> > 14184switch (GET_MODE_SIZE (mode))
> > 14185  {
> > 14186  case 1: size = "BYTE"; break;
> >
> >>
> >>>
> >>> And use
> >>> +  "vcvtph2\t{%1,
> >>> %0|%0, %1}"
> >>>
> >>>> [(set_attr "type" "ssecvt")
> >>>>  (set_attr "prefix" "evex")
> >>>>  (set_attr "mode" "")])
> >>>> @@ -29854,7 +29855,8 @@ (define_insn 
> >>>> "avx512dq_vmfpclass"
> >>>>UNSPEC_FPCLASS)
> >>>>  (const_int 1)))]
> >>>>  "TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)"
> >>>> -   "vfpclass\t{%2, %1, 
> >>>> %0|%0, %1, %2}";
> >>>> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
> >>>> +   "vfpclass\t{%2, %1, 
> >>>> %0|%0, %X1, %2}";
> >>>
> >>> For scaar memory operand rewrite, we usually use , so
> >>>  "vfpclass\t{%2, %1,
> >>> %0|%0,
> >>> %1, %2}";
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>


-- 
BR,
Hongtao


Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions

2024-10-20 Thread Hongtao Liu
On Sat, Oct 19, 2024 at 2:06 AM Antoni Boucher  wrote:
>
> Thanks for the review.
> Here's the updated patch.
>
> Le 2024-10-17 à 21 h 50, Hongtao Liu a écrit :
> > On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher  wrote:
> >>
> >> Hi.
> >> This is a patch for the bug 116725.
> >> I'm not sure if it is a good fix, but it seems to do the job.
> >> If you have suggestions for better comments than what I wrote that would
> >> explain what's happening, I'm open to suggestions.
> >
> >> @@ -7548,7 +7548,8 @@ (define_insn 
> >> "avx512fp16_vcvtph2_<
> >>  [(match_operand: 1 "" 
> >> "")]
> >>  UNSPEC_US_FIX_NOTRUNC))]
> >>"TARGET_AVX512FP16 && "
> >> -  "vcvtph2\t{%1, 
> >> %0|%0, %1}"
> >> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
> >> +  "vcvtph2\t{%1, 
> >> %0|%0, %X1}"
> > Could you define something like
> >
> >   ;; Pointer size override for 16-bit upper-convert modes (Intel asm 
> > dialect)
> >   (define_mode_attr iptrh
> >[(V32HI "") (V16SI "") (V8DI "")
> > (V16HI "") (V8SI "") (V4DI "q")
> > (V8HI "") (V4SI "q") (V2DI "k")])
>
> For my own understanding, was my usage of %X equivalent to a mode_attr
> with an empty string for all cases?
> How did you know which one needed an empty string?

It's in ix86_print_operand
14155  else if (MEM_P (x))
14156{
14157  rtx addr = XEXP (x, 0);
14158
14159  /* No `byte ptr' prefix for call instructions ... */
14160  if (ASSEMBLER_DIALECT == ASM_INTEL && code != 'X' && code != 'P')
14161{
14162  machine_mode mode = GET_MODE (x);
14163  const char *size;
14164
14165  /* Check for explicit size override codes.  */
14166  if (code == 'b')
14167size = "BYTE";
14168  else if (code == 'w')
14169size = "WORD";
14170  else if (code == 'k')
14171size = "DWORD";
14172  else if (code == 'q')
14173size = "QWORD";
14174  else if (code == 'x')
14175size = "XMMWORD";
14176  else if (code == 't')
14177size = "YMMWORD";
14178  else if (code == 'g')
14179size = "ZMMWORD";
14180  else if (mode == BLKmode)
14181/* ... or BLKmode operands, when not overridden.  */
14182size = NULL;
14183  else
14184switch (GET_MODE_SIZE (mode))
14185  {
14186  case 1: size = "BYTE"; break;

>
> >
> > And use
> > +  "vcvtph2\t{%1,
> > %0|%0, %1}"
> >
> >>[(set_attr "type" "ssecvt")
> >> (set_attr "prefix" "evex")
> >> (set_attr "mode" "")])
> >> @@ -29854,7 +29855,8 @@ (define_insn 
> >> "avx512dq_vmfpclass"
> >>   UNSPEC_FPCLASS)
> >> (const_int 1)))]
> >> "TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)"
> >> -   "vfpclass\t{%2, %1, 
> >> %0|%0, %1, %2}";
> >> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
> >> +   "vfpclass\t{%2, %1, 
> >> %0|%0, %X1, %2}";
> >
> > For scaar memory operand rewrite, we usually use , so
> > "vfpclass\t{%2, %1,
> > %0|%0,
> > %1, %2}";
> >
> >
> >
> >



-- 
BR,
Hongtao


Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions

2024-10-17 Thread Hongtao Liu
On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher  wrote:
>
> Hi.
> This is a patch for the bug 116725.
> I'm not sure if it is a good fix, but it seems to do the job.
> If you have suggestions for better comments than what I wrote that would
> explain what's happening, I'm open to suggestions.

>@@ -7548,7 +7548,8 @@ (define_insn 
>"avx512fp16_vcvtph2_<
> [(match_operand: 1 "" 
> "")]
> UNSPEC_US_FIX_NOTRUNC))]
>   "TARGET_AVX512FP16 && "
>-  "vcvtph2\t{%1, 
>%0|%0, %1}"
>+;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
>+  "vcvtph2\t{%1, 
>%0|%0, %X1}"
Could you define something like

 ;; Pointer size override for 16-bit upper-convert modes (Intel asm dialect)
 (define_mode_attr iptrh
  [(V32HI "") (V16SI "") (V8DI "")
   (V16HI "") (V8SI "") (V4DI "q")
   (V8HI "") (V4SI "q") (V2DI "k")])

And use
+  "vcvtph2\t{%1,
%0|%0, %1}"

>   [(set_attr "type" "ssecvt")
>(set_attr "prefix" "evex")
>(set_attr "mode" "")])
>@@ -29854,7 +29855,8 @@ (define_insn 
>"avx512dq_vmfpclass"
>  UNSPEC_FPCLASS)
>(const_int 1)))]
>"TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)"
>-   "vfpclass\t{%2, %1, 
>%0|%0, %1, %2}";
>+;; %X1 so that we don't emit any *WORD PTR for -masm=intel.
>+   "vfpclass\t{%2, %1, 
>%0|%0, %X1, %2}";

For scaar memory operand rewrite, we usually use , so
   "vfpclass\t{%2, %1,
%0|%0,
%1, %2}";




-- 
BR,
Hongtao


Re: [PATCH] testsuite: Fix typos for AVX10.2 convert testcases

2024-10-17 Thread Hongtao Liu
On Thu, Oct 17, 2024 at 3:17 PM Haochen Jiang  wrote:
>
> From: Victor Rodriguez 
>
> Hi all,
>
> There are some typos in AVX10.2 vcvtne[,2]ph[b,h]f8[,s] testcases.
> They will lead to type mismatch.
>
> Previously they are not found due to the binutils did not checkin.
>
> Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> ---
>
> Fix typos related to types for vcvtne[,2]ph[b,h]f8[,s] testcases.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c: Fix typo.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c: Ditto.
> ---
>  .../gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c  | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c  | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c   | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c  | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c   | 10 +-
>  .../gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c  | 10 +-
>  8 files changed, 40 insertions(+), 40 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c
> index 0dd58ee710e..7e7865d64fe 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c
> @@ -65,16 +65,16 @@ TEST (void)
>CALC(res_ref, src1.a, src2.a);
>
>res1.x = INTRINSIC (_cvtne2ph_pbf8) (src1.x, src2.x);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref))
>  abort ();
>
>res2.x = INTRINSIC (_mask_cvtne2ph_pbf8) (res2.x, mask, src1.x, src2.x);
> -  MASK_MERGE (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  MASK_MERGE (i_b) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref))
>  abort ();
>
>res3.x = INTRINSIC (_maskz_cvtne2ph_pbf8) (mask, src1.x, src2.x);
> -  MASK_ZERO (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  MASK_ZERO (i_b) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res3, res_ref))
>  abort ();
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c
> index 5e3ea3e37a4..0ca0c420ff7 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c
> @@ -65,16 +65,16 @@ TEST (void)
>CALC(res_ref, src1.a, src2.a);
>
>res1.x = INTRINSIC (_cvtnes2ph_pbf8) (src1.x, src2.x);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref))
>  abort ();
>
>res2.x = INTRINSIC (_mask_cvtnes2ph_pbf8) (res2.x, mask, src1.x, src2.x);
> -  MASK_MERGE (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  MASK_MERGE (i_b) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref))
>  abort ();
>
>res3.x = INTRINSIC (_maskz_cvtnes2ph_pbf8) (mask, src1.x, src2.x);
> -  MASK_ZERO (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  MASK_ZERO (i_b) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res3, res_ref))
>  abort ();
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c
> index aa928b582b3..97afd395bb5 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c
> @@ -65,16 +65,16 @@ TEST (void)
>CALC(res_ref, src1.a, src2.a);
>
>res1.x = INTRINSIC (_cvtne2ph_phf8) (src1.x, src2.x);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref))
>  abort ();
>
>res2.x = INTRINSIC (_mask_cvtne2ph_phf8) (res2.x, mask, src1.x, src2.x);
> -  MASK_MERGE (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> +  MASK_MERGE (i_b) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref))
>  abort ();
>
>res3.x = INTRINSIC (_maskz_cvtne2ph_phf8) (mask, src1.x, src2.x);
> -  MASK_ZERO (h) (res_ref, mask, SIZE);
> -  if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref))
> 

Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization

2024-10-14 Thread Hongtao Liu
On Mon, Oct 14, 2024 at 1:50 PM Richard Biener  wrote:
>
> On Mon, 14 Oct 2024, Hongtao Liu wrote:
>
> > On Sun, Oct 13, 2024 at 8:02 PM Richard Biener  wrote:
> > >
> > > On Sun, 13 Oct 2024, Hongtao Liu wrote:
> > >
> > > > On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu  wrote:
> > > > >
> > > > > On Fri, Oct 11, 2024 at 8:22 PM Richard Biener  
> > > > > wrote:
> > > > > >
> > > > > > The following helps the x86 backend by canonicalizing FMAs to have
> > > > > > any negation done to one of the commutative multiplication operands
> > > > > > be done to a register (and not a memory operand).  Likewise to
> > > > > > put a register operand first and a memory operand second;
> > > > > > swap_commutative_operands_p seems to treat REG_P and MEM_P the
> > > > > > same but comments indicate "complex expressiosn should be first".
> > > > > >
> > > > > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and
> > > > > > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the
> > > > > > reasons for the testsuite regressions in 
> > > > > > gcc.target/i386/cond_op_fma*.c
> > > > > >
> > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > >
> > > > > > I'm not quite sure this is the correct approach - simplify-rtx
> > > > > > doesn't seem to do "only canonicalization" but the existing FMA
> > > > > > case looks odd in that context.
> > > > > >
> > > > > > Should the target simply reject cases with wrong "canonicalization"
> > > > > > or does it need to cope with all variants in the patterns that fail
> > > > > > matching during combine without the change?
> > > > > Let me try the backend fix first.
> > > > The backend fix requires at least 8 more patterns, so I think RTL
> > > > canonicalization would be better.
> > > > Please go ahead with the patch.
> > >
> > > I'm still looking for insights on how we usually canonicalize on RTL
> > > (and where we document canonicalizations) and how we maintain RTL
> > > in canonical form.
> > >
> > > I'm also still wondering why the backend accepts "non-canonical" RTL
> > > instead of rejecting it, giving the middle-end the chance to try
> > > an alternative variant?
> > So you mean middle-end will alway canonicalize (fma: reg mem reg) to
> > (fma: mem reg reg)?
> > I only saw that RTL will canonicalize (fma: a (neg: b) c) to (fma: (neg a) 
> > b c).
> > Or what do you mean about "non-canonical" RTL in the backend?
>
> "non-canonical" RTL in the backend is what the patterns in question
> for this bugreport do not accept.  But maybe I'm missing something here.
>
> IIRC there's code somewhere in combine to try several "canonical"
> varaints of an insns in recog_for_combine, but as said I'm not very
> familiar with how things work on RTL to decide what's conceptually the
> correct thing to do here.  I just discvered simplify_rtx already does some
> minor canonicalization for FMAs ...
I think FMA itself is ok, reg or mem is also ok, the problem is with
masked FMA(with mult operands).
so there are
(vec_merge: (fma: op1 op2 op3) op1 mask) or (vec_merge: (fma:op2 op1
op3) op1 mask)
They're the same instruction since op1 and op2 are commutative.
Either the middle-end should canonicalize them, or the backend adds an
extra alternative to match the potential optimization.

So maybe in combine.cc: maybe_swap_commutative_operands, let's always
canonicalize to (vec_merge: (fma: op1 op2 op3) op1 mask),  together
with the backend patch(which relaxes the predicates + some other
potential changes for the pattern), should fix the issue.

>
> Richard.
>
> > >
> > > Richard.
> > >
> > > > > >
> > > > > > Thanks,
> > > > > > Richard.
> > > > > >
> > > > > > PR target/117072
> > > > > > * simplify-rtx.cc 
> > > > > > (simplify_context::simplify_ternary_operation):
> > > > > > Adjust FMA canonicalization.
> > > > > > ---
> > > > > >  gcc/simplify-rtx.cc | 15 +--
> > > > > >  1 file changed, 13 insertions(+), 2 d

Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization

2024-10-13 Thread Hongtao Liu
On Sun, Oct 13, 2024 at 8:02 PM Richard Biener  wrote:
>
> On Sun, 13 Oct 2024, Hongtao Liu wrote:
>
> > On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu  wrote:
> > >
> > > On Fri, Oct 11, 2024 at 8:22 PM Richard Biener  wrote:
> > > >
> > > > The following helps the x86 backend by canonicalizing FMAs to have
> > > > any negation done to one of the commutative multiplication operands
> > > > be done to a register (and not a memory operand).  Likewise to
> > > > put a register operand first and a memory operand second;
> > > > swap_commutative_operands_p seems to treat REG_P and MEM_P the
> > > > same but comments indicate "complex expressiosn should be first".
> > > >
> > > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and
> > > > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the
> > > > reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > >
> > > > I'm not quite sure this is the correct approach - simplify-rtx
> > > > doesn't seem to do "only canonicalization" but the existing FMA
> > > > case looks odd in that context.
> > > >
> > > > Should the target simply reject cases with wrong "canonicalization"
> > > > or does it need to cope with all variants in the patterns that fail
> > > > matching during combine without the change?
> > > Let me try the backend fix first.
> > The backend fix requires at least 8 more patterns, so I think RTL
> > canonicalization would be better.
> > Please go ahead with the patch.
>
> I'm still looking for insights on how we usually canonicalize on RTL
> (and where we document canonicalizations) and how we maintain RTL
> in canonical form.
>
> I'm also still wondering why the backend accepts "non-canonical" RTL
> instead of rejecting it, giving the middle-end the chance to try
> an alternative variant?
So you mean middle-end will alway canonicalize (fma: reg mem reg) to
(fma: mem reg reg)?
I only saw that RTL will canonicalize (fma: a (neg: b) c) to (fma: (neg a) b c).
Or what do you mean about "non-canonical" RTL in the backend?

>
> Richard.
>
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > PR target/117072
> > > > * simplify-rtx.cc 
> > > > (simplify_context::simplify_ternary_operation):
> > > > Adjust FMA canonicalization.
> > > > ---
> > > >  gcc/simplify-rtx.cc | 15 +--
> > > >  1 file changed, 13 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> > > > index e8e60404ef6..8b4fa0d7aa4 100644
> > > > --- a/gcc/simplify-rtx.cc
> > > > +++ b/gcc/simplify-rtx.cc
> > > > @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation 
> > > > (rtx_code code, machine_mode mode,
> > > > op0 = tem, op1 = XEXP (op1, 0), any_change = true;
> > > > }
> > > >
> > > > -  /* Canonicalize the two multiplication operands.  */
> > > > +  /* Canonicalize the two multiplication operands.  A negation
> > > > +should go first and if possible the negation should be
> > > > +to a register.  */
> > > >/* a * -b + c  =>  -b * a + c.  */
> > > > -  if (swap_commutative_operands_p (op0, op1))
> > > > +  if (swap_commutative_operands_p (op0, op1)
> > > > + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG))
> > > > std::swap (op0, op1), any_change = true;
> > > > +  else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0))
> > > > +  && REG_P (op1))
> > > > +   {
> > > > + op0 = XEXP (op0, 0);
> > > > + op1 = simplify_gen_unary (NEG, mode, op1, mode);
> > > > + std::swap (op0, op1);
> > > > + any_change = true;
> > > > +   }
> > > >
> > > >if (any_change)
> > > > return gen_rtx_FMA (mode, op0, op1, op2);
> > > > --
> > > > 2.43.0
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> >
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)



-- 
BR,
Hongtao


Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization

2024-10-13 Thread Hongtao Liu
On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu  wrote:
>
> On Fri, Oct 11, 2024 at 8:22 PM Richard Biener  wrote:
> >
> > The following helps the x86 backend by canonicalizing FMAs to have
> > any negation done to one of the commutative multiplication operands
> > be done to a register (and not a memory operand).  Likewise to
> > put a register operand first and a memory operand second;
> > swap_commutative_operands_p seems to treat REG_P and MEM_P the
> > same but comments indicate "complex expressiosn should be first".
> >
> > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and
> > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the
> > reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > I'm not quite sure this is the correct approach - simplify-rtx
> > doesn't seem to do "only canonicalization" but the existing FMA
> > case looks odd in that context.
> >
> > Should the target simply reject cases with wrong "canonicalization"
> > or does it need to cope with all variants in the patterns that fail
> > matching during combine without the change?
> Let me try the backend fix first.
The backend fix requires at least 8 more patterns, so I think RTL
canonicalization would be better.
Please go ahead with the patch.
> >
> > Thanks,
> > Richard.
> >
> > PR target/117072
> > * simplify-rtx.cc (simplify_context::simplify_ternary_operation):
> > Adjust FMA canonicalization.
> > ---
> >  gcc/simplify-rtx.cc | 15 +--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> > index e8e60404ef6..8b4fa0d7aa4 100644
> > --- a/gcc/simplify-rtx.cc
> > +++ b/gcc/simplify-rtx.cc
> > @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation 
> > (rtx_code code, machine_mode mode,
> > op0 = tem, op1 = XEXP (op1, 0), any_change = true;
> > }
> >
> > -  /* Canonicalize the two multiplication operands.  */
> > +  /* Canonicalize the two multiplication operands.  A negation
> > +should go first and if possible the negation should be
> > +to a register.  */
> >/* a * -b + c  =>  -b * a + c.  */
> > -  if (swap_commutative_operands_p (op0, op1))
> > +  if (swap_commutative_operands_p (op0, op1)
> > + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG))
> > std::swap (op0, op1), any_change = true;
> > +  else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0))
> > +  && REG_P (op1))
> > +   {
> > + op0 = XEXP (op0, 0);
> > + op1 = simplify_gen_unary (NEG, mode, op1, mode);
> > + std::swap (op0, op1);
> > + any_change = true;
> > +   }
> >
> >if (any_change)
> > return gen_rtx_FMA (mode, op0, op1, op2);
> > --
> > 2.43.0
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization

2024-10-11 Thread Hongtao Liu
On Fri, Oct 11, 2024 at 8:22 PM Richard Biener  wrote:
>
> The following helps the x86 backend by canonicalizing FMAs to have
> any negation done to one of the commutative multiplication operands
> be done to a register (and not a memory operand).  Likewise to
> put a register operand first and a memory operand second;
> swap_commutative_operands_p seems to treat REG_P and MEM_P the
> same but comments indicate "complex expressiosn should be first".
>
> In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and
> (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the
> reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> I'm not quite sure this is the correct approach - simplify-rtx
> doesn't seem to do "only canonicalization" but the existing FMA
> case looks odd in that context.
>
> Should the target simply reject cases with wrong "canonicalization"
> or does it need to cope with all variants in the patterns that fail
> matching during combine without the change?
Let me try the backend fix first.
>
> Thanks,
> Richard.
>
> PR target/117072
> * simplify-rtx.cc (simplify_context::simplify_ternary_operation):
> Adjust FMA canonicalization.
> ---
>  gcc/simplify-rtx.cc | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index e8e60404ef6..8b4fa0d7aa4 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation 
> (rtx_code code, machine_mode mode,
> op0 = tem, op1 = XEXP (op1, 0), any_change = true;
> }
>
> -  /* Canonicalize the two multiplication operands.  */
> +  /* Canonicalize the two multiplication operands.  A negation
> +should go first and if possible the negation should be
> +to a register.  */
>/* a * -b + c  =>  -b * a + c.  */
> -  if (swap_commutative_operands_p (op0, op1))
> +  if (swap_commutative_operands_p (op0, op1)
> + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG))
> std::swap (op0, op1), any_change = true;
> +  else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0))
> +  && REG_P (op1))
> +   {
> + op0 = XEXP (op0, 0);
> + op1 = simplify_gen_unary (NEG, mode, op1, mode);
> + std::swap (op0, op1);
> + any_change = true;
> +   }
>
>if (any_change)
> return gen_rtx_FMA (mode, op0, op1, op2);
> --
> 2.43.0



-- 
BR,
Hongtao


Re: [PATCH] x86: Implement Fast-Math Float Truncation to BF16 via PSRLD Instruction

2024-10-09 Thread Hongtao Liu
On Tue, Oct 8, 2024 at 3:24 PM Levy Hsu  wrote:
>
> Bootstrapped and tested on x86_64-linux-gnu, OK for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.md: Rewrite insn truncsfbf2.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/truncsfbf-1.c: New test.
> * gcc.target/i386/truncsfbf-2.c: New test.
> ---
>  gcc/config/i386/i386.md | 16 ++---
>  gcc/testsuite/gcc.target/i386/truncsfbf-1.c |  9 +++
>  gcc/testsuite/gcc.target/i386/truncsfbf-2.c | 65 +
>  3 files changed, 83 insertions(+), 7 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/truncsfbf-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/truncsfbf-2.c
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 9c2a0aa6112..d3fee0968d8 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5672,16 +5672,18 @@
> (set_attr "mode" "HF")])
>
>  (define_insn "truncsfbf2"
> -  [(set (match_operand:BF 0 "register_operand" "=x, v")
> +  [(set (match_operand:BF 0 "register_operand" "=x,x,v,Yv")
> (float_truncate:BF
> - (match_operand:SF 1 "register_operand" "x,v")))]
> -  "((TARGET_AVX512BF16 && TARGET_AVX512VL) || TARGET_AVXNECONVERT)
> -   && !HONOR_NANS (BFmode) && flag_unsafe_math_optimizations"
> + (match_operand:SF 1 "register_operand" "0,x,v,Yv")))]
> +  "TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode)"
>"@
> +  psrld\t{$16, %0|%0, 16}
>%{vex%} vcvtneps2bf16\t{%1, %0|%0, %1}
> -  vcvtneps2bf16\t{%1, %0|%0, %1}"
> -  [(set_attr "isa" "avxneconvert,avx512bf16vl")
> -   (set_attr "prefix" "vex,evex")])
> +  vcvtneps2bf16\t{%1, %0|%0, %1}
> +  vpsrld\t{$16, %1, %0|%0, %1, 16}"
> +  [(set_attr "isa" "noavx,avxneconvert,avx512bf16vl,avx")
> +   (set_attr "prefix" "orig,vex,evex,vex")
> +   (set_attr "type" "sseishft1,ssecvt,ssecvt,sseishft1")])
>
>  ;; Signed conversion to DImode.
>
> diff --git a/gcc/testsuite/gcc.target/i386/truncsfbf-1.c 
> b/gcc/testsuite/gcc.target/i386/truncsfbf-1.c
> new file mode 100644
> index 000..dd3ff8a50b4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/truncsfbf-1.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-msse2 -O2 -ffast-math" } */
> +/* { dg-final { scan-assembler-times "psrld" 1 } } */
> +
> +__bf16
> +foo (float a)
> +{
> +  return a;
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/truncsfbf-2.c 
> b/gcc/testsuite/gcc.target/i386/truncsfbf-2.c
> new file mode 100644
> index 000..f4952f88fc9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/truncsfbf-2.c
> @@ -0,0 +1,65 @@
> +/* { dg-do run } */
> +/* { dg-options "-msse2 -O2 -ffast-math" } */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +__bf16
> +foo (float a)
> +{
> +  return a;
> +}
> +
> +static __bf16
> +CALC (float *a)
> +{
> +  uint32_t bits;
> +  memcpy (&bits, a, sizeof (bits));
> +  bits >>= 16;
> +  uint16_t bfloat16_bits = (uint16_t) bits;
> +  __bf16 bf16;
> +  memcpy (&bf16, &bfloat16_bits, sizeof (bf16));
> +  return bf16;
> +}
> +
> +int
> +main (void)
> +{
> +  float test_values[] = { 0.0f, -0.0f, 1.0f, -1.0f, 0.5f, -0.5f, 1000.0f, 
> -1000.0f,
> +  3.1415926f, -3.1415926f, 1e-8f, -1e-8f,
> +  1.0e+38f, -1.0e+38f, 1.0e-38f, -1.0e-38f };
> +  size_t num_values = sizeof (test_values) / sizeof (test_values[0]);
> +
> +  for (size_t i = 0; i < num_values; ++i)
> +{
> +  float original = test_values[i];
> +  __bf16 hw_bf16 = foo (original);
> +  __bf16 sw_bf16 = CALC (&original);
> +
> +  /* Verify psrld $16, %0 == %0 >> 16 */
> +  if (memcmp (&hw_bf16, &sw_bf16, sizeof (__bf16)) != 0)
> +abort ();
> +
> +  /* Reconstruct the float value from the __bf16 bits */
> +  uint16_t bf16_bits;
> +  memcpy (&bf16_bits, &hw_bf16, sizeof (bf16_bits));
> +  uint32_t reconstructed_bits = ((uint32_t) bf16_bits) << 16;
> +  float converted;
> +  memcpy (&converted, &reconstructed_bits, sizeof (converted));
> +
> +  float diff = fabsf (original - converted);
> +
> +  /* Expected Maximum Precision Loss */
> +  uint32_t orig_bits;
> +  memcpy (&orig_bits, &original, sizeof (orig_bits));
> +  int exponent = ((orig_bits >> 23) & 0xFF) - 127;
> +  float expected_loss = (exponent == -127)
> +? ldexpf (1.0f, -126 - 7)
> +: ldexpf (1.0f, exponent - 7);
> +  if (diff > expected_loss)
> +abort ();
> +}
> +  return 0;
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH v2 2/2] Adjust testcase after relax O2 vectorization.

2024-10-08 Thread Hongtao Liu
On Tue, Oct 8, 2024 at 4:56 PM Richard Biener
 wrote:
>
> On Tue, Oct 8, 2024 at 10:36 AM liuhongt  wrote:
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/fstack-protector-strong.c: Adjust
> > scan-assembler-times.
> > * gcc.dg/graphite/scop-6.c: Add
> > -Wno-aggressive-loop-optimizations.
> > * gcc.dg/graphite/scop-9.c: Ditto.
> > * gcc.dg/tree-ssa/ivopts-lt-2.c: Add -fno-tree-vectorize.
> > * gcc.dg/tree-ssa/ivopts-lt.c: Ditto.
> > * gcc.dg/tree-ssa/loop-16.c: Ditto.
> > * gcc.dg/tree-ssa/loop-28.c: Ditto.
> > * gcc.dg/tree-ssa/loop-bound-2.c: Ditto.
> > * gcc.dg/tree-ssa/loop-bound-4.c: Ditto.
> > * gcc.dg/tree-ssa/loop-bound-6.c: Ditto.
> > * gcc.dg/tree-ssa/predcom-4.c: Ditto.
> > * gcc.dg/tree-ssa/predcom-5.c: Ditto.
> > * gcc.dg/tree-ssa/scev-11.c: Ditto.
> > * gcc.dg/tree-ssa/scev-9.c: Ditto.
> > * gcc.dg/tree-ssa/split-path-11.c: Ditto.
> > * gcc.dg/unroll-8.c: Ditto.
> > * gcc.dg/var-expand1.c: Ditto.
> > * gcc.dg/vect/vect-cost-model-6.c: Ditto.
> > * gcc.target/i386/pr86270.c: Ditto.
> > * gcc.target/i386/pr86722.c: Ditto.
> > * gcc.target/x86_64/abi/callabi/leaf-2.c: Ditto.
> > ---
> >  gcc/testsuite/gcc.dg/fstack-protector-strong.c   | 2 +-
> >  gcc/testsuite/gcc.dg/graphite/scop-6.c   | 1 +
> >  gcc/testsuite/gcc.dg/graphite/scop-9.c   | 1 +
> >  gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c  | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c| 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/loop-16.c  | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/loop-28.c  | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/loop-bound-2.c | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/loop-bound-4.c | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/loop-bound-6.c | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c| 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c| 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/scev-11.c  | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/scev-9.c   | 2 +-
> >  gcc/testsuite/gcc.dg/tree-ssa/split-path-11.c| 2 +-
> >  gcc/testsuite/gcc.dg/unroll-8.c  | 3 +--
> >  gcc/testsuite/gcc.dg/var-expand1.c   | 2 +-
> >  gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c| 2 +-
> >  gcc/testsuite/gcc.target/i386/pr86270.c  | 2 +-
> >  gcc/testsuite/gcc.target/i386/pr86722.c  | 2 +-
> >  gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +-
> >  21 files changed, 21 insertions(+), 20 deletions(-)
> >
> > diff --git a/gcc/testsuite/gcc.dg/fstack-protector-strong.c 
> > b/gcc/testsuite/gcc.dg/fstack-protector-strong.c
> > index 94dc3508f1a..b9f63966b7c 100644
> > --- a/gcc/testsuite/gcc.dg/fstack-protector-strong.c
> > +++ b/gcc/testsuite/gcc.dg/fstack-protector-strong.c
> > @@ -154,4 +154,4 @@ void foo12 ()
> >global3 ();
> >  }
> >
> > -/* { dg-final { scan-assembler-times "stack_chk_fail" 12 } } */
> > +/* { dg-final { scan-assembler-times "stack_chk_fail" 11 } } */
> > diff --git a/gcc/testsuite/gcc.dg/graphite/scop-6.c 
> > b/gcc/testsuite/gcc.dg/graphite/scop-6.c
> > index 9bc1d9f4ccd..6ea887d9041 100644
> > --- a/gcc/testsuite/gcc.dg/graphite/scop-6.c
> > +++ b/gcc/testsuite/gcc.dg/graphite/scop-6.c
> > @@ -26,4 +26,5 @@ int toto()
> >return a[3][5] + b[2];
> >  }
>
> The testcase looks bogus:
>
>b[i+k] = b[i+k-5] + 2;
>
> accesses b[-3], can you instead adjust the inner loop to start with k == 4?
>
> > +/* { dg-additional-options "-Wno-aggressive-loop-optimizations" } */
> >  /* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite"} } 
> > */
> > diff --git a/gcc/testsuite/gcc.dg/graphite/scop-9.c 
> > b/gcc/testsuite/gcc.dg/graphite/scop-9.c
> > index b19291be2f8..2a36bf92fd4 100644
> > --- a/gcc/testsuite/gcc.dg/graphite/scop-9.c
> > +++ b/gcc/testsuite/gcc.dg/graphite/scop-9.c
> > @@ -21,4 +21,5 @@ int toto()
> >return a[3][5] + b[2];
> >  }
>
> Likewise.
>
> > +/* { dg-additional-options "-Wno-aggressive-loop-optimizations" } */
> >  /* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite"} } 
> > */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c
> > index bdbdbff19ff..be325775fbb 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O2 -fno-tree-loop-distribute-patterns 
> > -fdump-tree-ivopts" } */
> > +/* { dg-options "-O2 -fno-tree-vectorize 
> > -fno-tree-loop-distribute-patterns -fdump-tree-ivopts" } */
> >  /* { dg-skip-if "PR68644" { hppa*-*-* powerpc*-*-* } } */
> >
> >  void
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/iv

Re: [PATCH v2] x86/{,V}AES: adjust when to force EVEX encoding

2024-10-08 Thread Hongtao Liu
On Tue, Oct 8, 2024 at 3:00 PM Jan Beulich  wrote:
>
> On 08.10.2024 08:54, Hongtao Liu wrote:
> > On Mon, Sep 30, 2024 at 3:33 PM Jan Beulich  wrote:
> >>
> >> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
> >> said "..., but we need to emit {evex} prefix in the assembly if AES ISA
> >> is not enabled". Yet it did so only for the TARGET_AES insns. Going from
> >> the alternative chosen in the TARGET_VAES insns isn't quite right: If
> >> AES is (also) enabled, EVEX encoding would needlessly be forced.
> >>
> >> gcc/
> >>
> >> * config/i386/sse.md (vaesdec_, vaesdeclast_,
> >> vaesenc_, vaesenclast_): Replace which_alternative
> >> check by TARGET_AES one.
> >> ---
> >> As an aside - {evex} (and other) pseudo-prefixes would better be avoided
> >> anyway whenever possible, as those are getting in the way of code
> >> putting in place macro overrides for certain insns: gas 2.43 rejects
> >> such bogus placement of pseudo-prefixes.
> >>
> >> Is it, btw, correct that none of these insns have a "prefix" attribute?
> > There's some automatic in i386.md to determine the prefix, rough, not
> > very accurate.
> >
> >   688;; Prefix used: original, VEX or maybe VEX.
> >   689(define_attr "prefix" "orig,vex,maybe_vex,evex,maybe_evex"
> >   690  (cond [(eq_attr "mode" "OI,V8SF,V4DF")
> >   691   (const_string "vex")
> >   692 (eq_attr "mode" "XI,V16SF,V8DF")
> >   693   (const_string "evex")
> >   694 (eq_attr "type" "ssemuladd")
> >   695   (if_then_else (eq_attr "isa" "fma4")
> >   696 (const_string "vex")
> >   697 (const_string "maybe_evex"))
> >   698 (eq_attr "type" "sse4arg")
> >   699   (const_string "vex")
> >   700]
> >   701(const_string "orig")))
> >   702
>
> I'm aware, and I raised the question because it seemed pretty clear to
> me that it wouldn't get things right here.
AFAIK It's mainly for attr length to determine the codesize of
instructions( ix86_min_insn_size), and the rough model should be
sufficient for most cases.
>
> >> ---
> >> v2: Adjust (shrink) description.
> > Ok for the patch.
>
> Thanks. What about the 14.x branch?
Also Ok.
>
> Jan



-- 
BR,
Hongtao


Re: [PATCH v2] x86/{,V}AES: adjust when to force EVEX encoding

2024-10-07 Thread Hongtao Liu
On Mon, Sep 30, 2024 at 3:33 PM Jan Beulich  wrote:
>
> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
> said "..., but we need to emit {evex} prefix in the assembly if AES ISA
> is not enabled". Yet it did so only for the TARGET_AES insns. Going from
> the alternative chosen in the TARGET_VAES insns isn't quite right: If
> AES is (also) enabled, EVEX encoding would needlessly be forced.
>
> gcc/
>
> * config/i386/sse.md (vaesdec_, vaesdeclast_,
> vaesenc_, vaesenclast_): Replace which_alternative
> check by TARGET_AES one.
> ---
> As an aside - {evex} (and other) pseudo-prefixes would better be avoided
> anyway whenever possible, as those are getting in the way of code
> putting in place macro overrides for certain insns: gas 2.43 rejects
> such bogus placement of pseudo-prefixes.
>
> Is it, btw, correct that none of these insns have a "prefix" attribute?
There's some automatic in i386.md to determine the prefix, rough, not
very accurate.

  688;; Prefix used: original, VEX or maybe VEX.
  689(define_attr "prefix" "orig,vex,maybe_vex,evex,maybe_evex"
  690  (cond [(eq_attr "mode" "OI,V8SF,V4DF")
  691   (const_string "vex")
  692 (eq_attr "mode" "XI,V16SF,V8DF")
  693   (const_string "evex")
  694 (eq_attr "type" "ssemuladd")
  695   (if_then_else (eq_attr "isa" "fma4")
  696 (const_string "vex")
  697 (const_string "maybe_evex"))
  698 (eq_attr "type" "sse4arg")
  699   (const_string "vex")
  700]
  701(const_string "orig")))
  702

> ---
> v2: Adjust (shrink) description.
Ok for the patch.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -30802,7 +30802,7 @@
>   UNSPEC_VAESDEC))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesdec\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesdec\t{%2, %1, %0|%0, %1, %2}";
> @@ -30816,7 +30816,7 @@
>   UNSPEC_VAESDECLAST))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesdeclast\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesdeclast\t{%2, %1, %0|%0, %1, %2}";
> @@ -30830,7 +30830,7 @@
>   UNSPEC_VAESENC))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesenc\t{%2, %1, %0|%0, %1, %2}";
> @@ -30844,7 +30844,7 @@
>   UNSPEC_VAESENCLAST))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesenclast\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesenclast\t{%2, %1, %0|%0, %1, %2}";



-- 
BR,
Hongtao


Re: [PATCH] x86: Extend AVX512 Vectorization for Popcount in Various Modes

2024-09-25 Thread Hongtao Liu
On Tue, Sep 24, 2024 at 10:16 AM Levy Hsu  wrote:
>
> This patch enables vectorization of the popcount operation for V2QI, V4QI,
> V8QI, V2HI, V4HI, and V2SI modes.
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/mmx.md:
> (VQI_16_32_64): New mode iterator for 8-byte, 4-byte, and 2-byte 
> QImode.
> (popcount2): New pattern for popcount of V2QI/V4QI/V8QI mode.
> (popcount2): New pattern for popcount of V2HI/V4HI mode.
> (popcountv2si2): New pattern for popcount of V2SI mode.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/part-vect-popcount-1.c: New test.
> ---
>  gcc/config/i386/mmx.md| 24 +
>  .../gcc.target/i386/part-vect-popcount-1.c| 49 +++
>  2 files changed, 73 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index 4bc191b874b..147ae150bf3 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -70,6 +70,9 @@
>  ;; 8-byte and 4-byte HImode vector modes
>  (define_mode_iterator VI2_32_64 [(V4HI "TARGET_MMX_WITH_SSE") V2HI])
>
> +;; 8-byte, 4-byte and 2-byte QImode vector modes
> +(define_mode_iterator VI1_16_32_64 [(V8QI "TARGET_MMX_WITH_SSE") V4QI V2QI])
> +
>  ;; 4-byte and 2-byte integer vector modes
>  (define_mode_iterator VI_16_32 [V4QI V2QI V2HI])
>
> @@ -6786,3 +6789,24 @@
>[(set_attr "type" "mmx")
> (set_attr "modrm" "0")
> (set_attr "memory" "none")])
> +
> +(define_insn "popcount2"
> +  [(set (match_operand:VI1_16_32_64 0 "register_operand" "=v")
> +   (popcount:VI1_16_32_64
> + (match_operand:VI1_16_32_64 1 "register_operand" "v")))]
> +  "TARGET_AVX512VL && TARGET_AVX512BITALG"
> +  "vpopcntb\t{%1, %0|%0, %1}")
> +
> +(define_insn "popcount2"
> +  [(set (match_operand:VI2_32_64 0 "register_operand" "=v")
> +   (popcount:VI2_32_64
> + (match_operand:VI2_32_64 1 "register_operand" "v")))]
> +  "TARGET_AVX512VL && TARGET_AVX512BITALG"
> +  "vpopcntw\t{%1, %0|%0, %1}")
> +
> +(define_insn "popcountv2si2"
> +  [(set (match_operand:V2SI 0 "register_operand" "=v")
> +   (popcount:V2SI
> + (match_operand:V2SI 1 "register_operand" "v")))]
> +  "TARGET_AVX512VPOPCNTDQ && TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
> +  "vpopcntd\t{%1, %0|%0, %1}")
> diff --git a/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c 
> b/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c
> new file mode 100644
> index 000..a30f6ec4726
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c
> @@ -0,0 +1,49 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512vpopcntdq -mavx512bitalg -mavx512vl" } */
> +/* { dg-final { scan-assembler-times "vpopcntd\[^\n\r\]*xmm\[0-9\]" 1 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpopcntw\[^\n\r\]*xmm\[0-9\]" 3 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vpopcntw\[^\n\r\]*xmm\[0-9\]" 2 { 
> target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vpopcntb\[^\n\r\]*xmm\[0-9\]" 4 { 
> target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vpopcntb\[^\n\r\]*xmm\[0-9\]" 3 { 
> target { ! ia32 } } } } */
> +
> +void
> +foo1 (int* a, int* __restrict b)
> +{
> +  for (int i = 0; i != 2; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> +
> +void
> +foo2 (unsigned short* a, unsigned short* __restrict b)
> +{
> +  for (int i = 0; i != 4; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> +
> +void
> +foo3 (unsigned short* a, unsigned short* __restrict b)
> +{
> +  for (int i = 0; i != 2; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> +
> +void
> +foo4 (unsigned char* a, unsigned char* __restrict b)
> +{
> +  for (int i = 0; i != 8; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> +
> +void
> +foo5 (unsigned char* a, unsigned char* __restrict b)
> +{
> +  for (int i = 0; i != 4; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> +
> +void
> +foo6 (unsigned char* a, unsigned char* __restrict b)
> +{
> +  for (int i = 0; i != 2; i++)
> +a[i] = __builtin_popcount (b[i]);
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386, v2: Add GENERIC and GIMPLE folders of __builtin_ia32_{min,max}* [PR116738]

2024-09-25 Thread Hongtao Liu
On Wed, Sep 25, 2024 at 4:42 PM Jakub Jelinek  wrote:
>
> On Wed, Sep 25, 2024 at 10:17:50AM +0800, Hongtao Liu wrote:
> > > +   for (int i = 0; i < 2; ++i)
> > > + {
> > > +   unsigned count = vector_cst_encoded_nelts (args[i]), j;
> > > +   for (j = 0; j < count; ++j)
> > > + if (!tree_expr_nan_p (VECTOR_CST_ENCODED_ELT (args[i], 
> > > j)))
> > Is this a typo? I assume you want to check if the component is NAN, so
> > tree_expr_nan_p, not !tree_expr_nan_p?
> > > +   break;
> > > +   if (j < count)
> > > + break;
> > Also this break just break the outer loop(for (int i = 0; i < 2;
> > i++)), but according to comments, it wants to break the outer switch?
>
> You're right, thanks for catching that.  Fortunately both meant just that
> it got NaNs optimized too and optimized the rest as it should.
>
> I just wanted to avoid return NULL_TREE; or goto and screwed it up.
>
> Here is a fixed version, tested additionally on looking at gimple dump on
> typedef float __v4sf __attribute__((vector_size (16)));
> __v4sf foo (void) { return __builtin_ia32_minss ((__v4sf) { __builtin_nanf 
> (""), 0.f, 0.f, 0.f }, (__v4sf) { __builtin_inff (), 1.0f, 2.0f, 3.0f }); }
> __v4sf bar (void) { return __builtin_ia32_minss ((__v4sf) { -__builtin_inff 
> (), 0.f, 0.f, 0.f }, (__v4sf) { __builtin_inff (), 1.0f, 2.0f, 3.0f }); }
>
> Ok for trunk if it passes bootstrap/regtest?
Ok.
>
> 2024-09-25  Jakub Jelinek  
>
> PR target/116738
> * config/i386/i386.cc (ix86_fold_builtin): Handle
> IX86_BUILTIN_M{IN,AX}{S,P}{S,H,D}*.
> (ix86_gimple_fold_builtin): Handle IX86_BUILTIN_M{IN,AX}P{S,H,D}*.
>
> * gcc.target/i386/avx512f-pr116738-1.c: New test.
> * gcc.target/i386/avx512f-pr116738-2.c: New test.
>
> --- gcc/config/i386/i386.cc.jj  2024-09-24 18:54:24.120313544 +0200
> +++ gcc/config/i386/i386.cc 2024-09-25 10:21:00.922417024 +0200
> @@ -18507,6 +18507,8 @@ ix86_fold_builtin (tree fndecl, int n_ar
> = (enum ix86_builtins) DECL_MD_FUNCTION_CODE (fndecl);
>enum rtx_code rcode;
>bool is_vshift;
> +  enum tree_code tcode;
> +  bool is_scalar;
>unsigned HOST_WIDE_INT mask;
>
>switch (fn_code)
> @@ -18956,6 +18958,131 @@ ix86_fold_builtin (tree fndecl, int n_ar
> }
>   break;
>
> +   case IX86_BUILTIN_MINSS:
> +   case IX86_BUILTIN_MINSH_MASK:
> + tcode = LT_EXPR;
> + is_scalar = true;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MAXSS:
> +   case IX86_BUILTIN_MAXSH_MASK:
> + tcode = GT_EXPR;
> + is_scalar = true;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MINPS:
> +   case IX86_BUILTIN_MINPD:
> +   case IX86_BUILTIN_MINPS256:
> +   case IX86_BUILTIN_MINPD256:
> +   case IX86_BUILTIN_MINPS512:
> +   case IX86_BUILTIN_MINPD512:
> +   case IX86_BUILTIN_MINPS128_MASK:
> +   case IX86_BUILTIN_MINPD128_MASK:
> +   case IX86_BUILTIN_MINPS256_MASK:
> +   case IX86_BUILTIN_MINPD256_MASK:
> +   case IX86_BUILTIN_MINPH128_MASK:
> +   case IX86_BUILTIN_MINPH256_MASK:
> +   case IX86_BUILTIN_MINPH512_MASK:
> + tcode = LT_EXPR;
> + is_scalar = false;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MAXPS:
> +   case IX86_BUILTIN_MAXPD:
> +   case IX86_BUILTIN_MAXPS256:
> +   case IX86_BUILTIN_MAXPD256:
> +   case IX86_BUILTIN_MAXPS512:
> +   case IX86_BUILTIN_MAXPD512:
> +   case IX86_BUILTIN_MAXPS128_MASK:
> +   case IX86_BUILTIN_MAXPD128_MASK:
> +   case IX86_BUILTIN_MAXPS256_MASK:
> +   case IX86_BUILTIN_MAXPD256_MASK:
> +   case IX86_BUILTIN_MAXPH128_MASK:
> +   case IX86_BUILTIN_MAXPH256_MASK:
> +   case IX86_BUILTIN_MAXPH512_MASK:
> + tcode = GT_EXPR;
> + is_scalar = false;
> +   do_minmax:
> + gcc_assert (n_args >= 2);
> + if (TREE_CODE (args[0]) != VECTOR_CST
> + || TREE_CODE (args[1]) != VECTOR_CST)
> +   break;
> + mask = HOST_WIDE_INT_M1U;
> + if (n_args > 2)
> +   {
> + gcc_assert (n_args >= 4);
> + /* This is masked minmax.  */
> + if (TREE_CODE (args[3]) != INTEGER_CST
> + || TREE_SIDE_EFFECTS (args[2]))
> +   break;
> + mask = TREE_INT_CST_LOW (args[3]);
> + unsigned elems = TYPE_VECTOR_SUBPARTS (

Re: [PATCH] x86/{,V}AES: adjust when to force EVEX encoding

2024-09-25 Thread Hongtao Liu
On Wed, Sep 25, 2024 at 3:55 PM Jan Beulich  wrote:
>
> On 25.09.2024 09:38, Hongtao Liu wrote:
> > On Wed, Sep 25, 2024 at 2:56 PM Jan Beulich  wrote:
> >>
> >> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
> >> said "..., but we need to emit {evex} prefix in the assembly if AES ISA
> >> is not enabled". Yet it did so only for the TARGET_AES insns. Going from
> >> the alternative chosen in the TARGET_VAES insns is wrong for two
> >> reasons:
> >> - if, with AES disabled, the latter alternative was chosen despite no
> >>   "high" XMM register nor any eGPR in use, gas would still pick the AES
> > w/o EVEX SSE REG or EGPR, the first alternative will always be
> > matched(alternative 0).
> > That is how it works(match from left to right).
>
> Well, if that's guaranteed to always be the case, then ...
>
> >>   (VEX) encoding when no {evex} pseudo-prefix is in use (which is
> >>   against - as stated by the description of said commit - AES presently
> >>   not being considered a prereq of VAES in gcc);
> >
> >> - if AES is (also) enabled, EVEX encoding would needlessly be forced.
> > So it's more like an optimization that use VEX encoding when AES is enabled?
>
> ... in a way it's an optimization, yes. I can adjust the description
> accordingly. However, it's not _just_ an optimization, it also is a
> fix for compilation (really: assembly) failing in ...
>
> >> ---
> >> As an aside - {evex} (and other) pseudo-prefixes would better be avoided
> >> anyway whenever possible, as those are getting in the way of code
> >> putting in place macro overrides for certain insns: gas 2.43 rejects
> >> such bogus placement of pseudo-prefixes.
So it sounds like a walkaround in GCC to avoid the gas bug?

In general, I'm ok for the patch since we already did that in
TARGET_AES patterns.

27060(define_insn "aesenc"
27061  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
27062(unspec:V2DI [(match_operand:V2DI 1 "register_operand"
"0,x,v")
27063   (match_operand:V2DI 2 "vector_operand"
"xja,xjm,vm")]
27064  UNSPEC_AESENC))]
27065  "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)"
27066  "@
27067   aesenc\t{%2, %0|%0, %2}
27068   * return TARGET_AES ? \"vaesenc\t{%2, %1, %0|%0, %1, %2}\" :
\"%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}\";
27069   vaesenc\t{%2, %1, %0|%0, %1, %2}"
27070  [(set_attr "isa" "noavx,avx,vaes_avx512vl")


>
> ... cases like this (which is how I actually came to notice the issue).
>
> Jan



--
BR,
Hongtao


Re: [PATCH] x86/{,V}AES: adjust when to force EVEX encoding

2024-09-25 Thread Hongtao Liu
On Wed, Sep 25, 2024 at 2:56 PM Jan Beulich  wrote:
>
> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
> said "..., but we need to emit {evex} prefix in the assembly if AES ISA
> is not enabled". Yet it did so only for the TARGET_AES insns. Going from
> the alternative chosen in the TARGET_VAES insns is wrong for two
> reasons:
> - if, with AES disabled, the latter alternative was chosen despite no
>   "high" XMM register nor any eGPR in use, gas would still pick the AES
w/o EVEX SSE REG or EGPR, the first alternative will always be
matched(alternative 0).
That is how it works(match from left to right).
>   (VEX) encoding when no {evex} pseudo-prefix is in use (which is
>   against - as stated by the description of said commit - AES presently
>   not being considered a prereq of VAES in gcc);

> - if AES is (also) enabled, EVEX encoding would needlessly be forced.
So it's more like an optimization that use VEX encoding when AES is enabled?
>
> gcc/
>
> * config/i386/sse.md (vaesdec_, vaesdeclast_,
> vaesenc_, vaesenclast_): Replace which_alternative
> check by TARGET_AES one.
> ---
> As an aside - {evex} (and other) pseudo-prefixes would better be avoided
> anyway whenever possible, as those are getting in the way of code
> putting in place macro overrides for certain insns: gas 2.43 rejects
> such bogus placement of pseudo-prefixes.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -30802,7 +30802,7 @@
>   UNSPEC_VAESDEC))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesdec\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesdec\t{%2, %1, %0|%0, %1, %2}";
> @@ -30816,7 +30816,7 @@
>   UNSPEC_VAESDECLAST))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)

>  return "%{evex%} vaesdeclast\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesdeclast\t{%2, %1, %0|%0, %1, %2}";
> @@ -30830,7 +30830,7 @@
>   UNSPEC_VAESENC))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesenc\t{%2, %1, %0|%0, %1, %2}";
> @@ -30844,7 +30844,7 @@
>   UNSPEC_VAESENCLAST))]
>"TARGET_VAES"
>  {
> -  if (which_alternative == 0 && mode == V16QImode)
> +  if (!TARGET_AES && mode == V16QImode)
>  return "%{evex%} vaesenclast\t{%2, %1, %0|%0, %1, %2}";
>else
>  return "vaesenclast\t{%2, %1, %0|%0, %1, %2}";



-- 
BR,
Hongtao


Re: [PATCH] i386: Add GENERIC and GIMPLE folders of __builtin_ia32_{min,max}* [PR116738]

2024-09-24 Thread Hongtao Liu
On Wed, Sep 25, 2024 at 1:07 AM Jakub Jelinek  wrote:
>
> Hi!
>
> The following patch adds GENERIC and GIMPLE folders for various
> x86 min/max builtins.
> As discussed, these builtins have effectively x < y ? x : y
> (or x > y ? x : y) behavior.
> The GENERIC folding is done if all the (relevant) arguments are
> constants (such as VECTOR_CST for vectors) and is done because
> the GIMPLE folding can't easily handle masking, rounding and the
> ss/sd cases (in a way that it would be pattern recognized back to the
> corresponding instructions).  The GIMPLE folding is also done just
> for TARGET_SSE4 or later when optimizing, otherwise it is apparently
> not matched back.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2024-09-24  Jakub Jelinek  
>
> PR target/116738
> * config/i386/i386.cc (ix86_fold_builtin): Handle
> IX86_BUILTIN_M{IN,AX}{S,P}{S,H,D}*.
> (ix86_gimple_fold_builtin): Handle IX86_BUILTIN_M{IN,AX}P{S,H,D}*.
>
> * gcc.target/i386/avx512f-pr116738-1.c: New test.
> * gcc.target/i386/avx512f-pr116738-2.c: New test.
>
> --- gcc/config/i386/i386.cc.jj  2024-09-12 10:56:57.344683959 +0200
> +++ gcc/config/i386/i386.cc 2024-09-23 15:15:40.154783766 +0200
> @@ -18507,6 +18507,8 @@ ix86_fold_builtin (tree fndecl, int n_ar
> = (enum ix86_builtins) DECL_MD_FUNCTION_CODE (fndecl);
>enum rtx_code rcode;
>bool is_vshift;
> +  enum tree_code tcode;
> +  bool is_scalar;
>unsigned HOST_WIDE_INT mask;
>
>switch (fn_code)
> @@ -18956,6 +18958,133 @@ ix86_fold_builtin (tree fndecl, int n_ar
> }
>   break;
>
> +   case IX86_BUILTIN_MINSS:
> +   case IX86_BUILTIN_MINSH_MASK:
> + tcode = LT_EXPR;
> + is_scalar = true;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MAXSS:
> +   case IX86_BUILTIN_MAXSH_MASK:
> + tcode = GT_EXPR;
> + is_scalar = true;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MINPS:
> +   case IX86_BUILTIN_MINPD:
> +   case IX86_BUILTIN_MINPS256:
> +   case IX86_BUILTIN_MINPD256:
> +   case IX86_BUILTIN_MINPS512:
> +   case IX86_BUILTIN_MINPD512:
> +   case IX86_BUILTIN_MINPS128_MASK:
> +   case IX86_BUILTIN_MINPD128_MASK:
> +   case IX86_BUILTIN_MINPS256_MASK:
> +   case IX86_BUILTIN_MINPD256_MASK:
> +   case IX86_BUILTIN_MINPH128_MASK:
> +   case IX86_BUILTIN_MINPH256_MASK:
> +   case IX86_BUILTIN_MINPH512_MASK:
> + tcode = LT_EXPR;
> + is_scalar = false;
> + goto do_minmax;
> +
> +   case IX86_BUILTIN_MAXPS:
> +   case IX86_BUILTIN_MAXPD:
> +   case IX86_BUILTIN_MAXPS256:
> +   case IX86_BUILTIN_MAXPD256:
> +   case IX86_BUILTIN_MAXPS512:
> +   case IX86_BUILTIN_MAXPD512:
> +   case IX86_BUILTIN_MAXPS128_MASK:
> +   case IX86_BUILTIN_MAXPD128_MASK:
> +   case IX86_BUILTIN_MAXPS256_MASK:
> +   case IX86_BUILTIN_MAXPD256_MASK:
> +   case IX86_BUILTIN_MAXPH128_MASK:
> +   case IX86_BUILTIN_MAXPH256_MASK:
> +   case IX86_BUILTIN_MAXPH512_MASK:
> + tcode = GT_EXPR;
> + is_scalar = false;
> +   do_minmax:
> + gcc_assert (n_args >= 2);
> + if (TREE_CODE (args[0]) != VECTOR_CST
> + || TREE_CODE (args[1]) != VECTOR_CST)
> +   break;
> + mask = HOST_WIDE_INT_M1U;
> + if (n_args > 2)
> +   {
> + gcc_assert (n_args >= 4);
> + /* This is masked minmax.  */
> + if (TREE_CODE (args[3]) != INTEGER_CST
> + || TREE_SIDE_EFFECTS (args[2]))
> +   break;
> + mask = TREE_INT_CST_LOW (args[3]);
> + unsigned elems = TYPE_VECTOR_SUBPARTS (TREE_TYPE (args[0]));
> + mask |= HOST_WIDE_INT_M1U << elems;
> + if (mask != HOST_WIDE_INT_M1U
> + && TREE_CODE (args[2]) != VECTOR_CST)
> +   break;
> + if (n_args >= 5)
> +   {
> + if (!tree_fits_uhwi_p (args[4]))
> +   break;
> + if (tree_to_uhwi (args[4]) != 4
> + && tree_to_uhwi (args[4]) != 8)
> +   break;
> +   }
> + if (mask == (HOST_WIDE_INT_M1U << elems))
> +   return args[2];
> +   }
> + /* Punt on NaNs, unless exceptions are disabled.  */
> + if (HONOR_NANS (args[0])
> + && (n_args < 5 || tree_to_uhwi (args[4]) != 8))
> +   for (int i = 0; i < 2; ++i)
> + {
> +   unsigned count = vector_cst_encoded_nelts (args[i]), j;
> +   for (j = 0; j < count; ++j)
> + if (!tree_expr_nan_p (VECTOR_CST_ENCODED_ELT (args[i], j)))
Is this a typo? I assume you want to check if the component is NAN, so
tree_expr_nan_p, not !tree_expr_nan_p?
> +   break;
> +   if (j < count)
> +

Re: [PATCH] [x86] Define VECTOR_STORE_FLAG_VALUE

2024-09-24 Thread Hongtao Liu
On Tue, Sep 24, 2024 at 5:46 PM Uros Bizjak  wrote:
>
> On Tue, Sep 24, 2024 at 11:23 AM liuhongt  wrote:
> >
> > Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT.
> > Otherwise NULL_RTX.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready push to trunk.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro.
> >
> > gcc/testsuite/ChangeLog:
> > * gcc.dg/rtl/x86_64/vector_eq.c: New test.
> > ---
> >  gcc/config/i386/i386.h  |  5 +++-
> >  gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c | 26 +
> >  2 files changed, 30 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index c1ec92ffb15..b12be41424f 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -899,7 +899,10 @@ extern const char *host_detect_local_cpu (int argc, 
> > const char **argv);
> > and give entire struct the alignment of an int.  */
> >  /* Required on the 386 since it doesn't have bit-field insns.  */
> >  #define PCC_BITFIELD_TYPE_MATTERS 1
> > -
> > +
> > +#define VECTOR_STORE_FLAG_VALUE(MODE) \
> > +  (GET_MODE_CLASS (MODE) == MODE_VECTOR_INT ? constm1_rtx : NULL_RTX)
> > +
> >  /* Standard register usage.  */
> >
> >  /* This processor has special stack-like registers.  See reg-stack.cc
> > diff --git a/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c 
> > b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
> > new file mode 100644
> > index 000..b82603d0b64
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c
> > @@ -0,0 +1,26 @@
> > +/* { dg-do compile { target x86_64-*-* } } */
>
> target { { i?86-*-* x86_64-*-* } && lp64 }
Thanks, changed.
>
> Uros.
>
> > +/* { dg-additional-options "-O2 -march=x86-64-v3" } */
> > +
> > +typedef int v4si __attribute__((vector_size(16)));
> > +
> > +v4si __RTL (startwith ("vregs")) foo (void)
> > +{
> > +(function "foo"
> > +  (insn-chain
> > +(block 2
> > +  (edge-from entry (flags "FALLTHRU"))
> > +  (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK)
> > +  (cnote 2 NOTE_INSN_FUNCTION_BEG)
> > +  (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) 
> > (const_int 0) (const_int 0) (const_int 0)])))
> > +  (cinsn 5 (set (reg:V4SI <2>)
> > +   (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>
> > +  (cinsn 6 (set (reg:V4SI <3>) (reg:V4SI <2>)))
> > +  (cinsn 7 (set (reg:V4SI xmm0) (reg:V4SI <3>)))
> > +  (edge-to exit (flags "FALLTHRU"))
> > +)
> > +  )
> > + (crtl (return_rtx (reg/i:V4SI xmm0)))
> > +)
> > +}
> > +
> > +/* { dg-final { scan-assembler-not "vpxor" } } */
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-09-23 Thread Hongtao Liu
On Thu, Sep 19, 2024 at 2:08 PM Richard Biener

 wrote:
>
> On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford
>  wrote:
> >
> > Richard Biener  writes:
> > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu  wrote:
> > >>
> > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu  wrote:
> > >> >
> > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> > >> >  wrote:
> > >> > >
> > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt  
> > >> > > wrote:
> > >> > > >
> > >> > > > GCC12 enables vectorization for O2 with very cheap cost model 
> > >> > > > which is restricted
> > >> > > > to constant tripcount. The vectorization capacity is very limited 
> > >> > > > w/ consideration
> > >> > > > of codesize impact.
> > >> > > >
> > >> > > > The patch extends the very cheap cost model a little bit to 
> > >> > > > support variable tripcount.
> > >> > > > But still disable peeling for gaps/alignment, runtime aliasing 
> > >> > > > checking and epilogue
> > >> > > > vectorization with the consideration of codesize.
> > >> > > >
> > >> > > > So there're at most 2 versions of loop for O2 vectorization, one 
> > >> > > > vectorized main loop
> > >> > > > , one scalar/remainder loop.
> > >> > > >
> > >> > > > .i.e.
> > >> > > >
> > >> > > > void
> > >> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> > >> > > > {
> > >> > > >  for (int i = 0; i != n; i++)
> > >> > > >   a[i] = b[i] + c[i];
> > >> > > > }
> > >> > > >
> > >> > > > with -O2 -march=x86-64-v3, will be vectorized to
> > >> > > >
> > >> > > > .L10:
> > >> > > > vmovdqu (%r8,%rax), %ymm0
> > >> > > > vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > >> > > > vmovdqu %ymm0, (%rdi,%rax)
> > >> > > > addq$32, %rax
> > >> > > > cmpq%rdx, %rax
> > >> > > > jne .L10
> > >> > > > movl%ecx, %eax
> > >> > > > andl$-8, %eax
> > >> > > > cmpl%eax, %ecx
> > >> > > > je  .L21
> > >> > > > vzeroupper
> > >> > > > .L12:
> > >> > > > movl(%r8,%rax,4), %edx
> > >> > > > addl(%rsi,%rax,4), %edx
> > >> > > > movl%edx, (%rdi,%rax,4)
> > >> > > > addq$1, %rax
> > >> > > > cmpl%eax, %ecx
> > >> > > > jne .L12
> > >> > > >
> > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves 
> > >> > > > performance by 4.11%
> > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance 
> > >> > > > by 5.74% with
> > >> > > > extra 8.88% codesize. The details are as below
> > >> > >
> > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > >> > > model numbers?
> > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> > >> > >
> > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >> > > >
> > >> > > > N-Iter  cheap cost model
> > >> > > > 500.perlbench_r -0.12%  -0.12%
> > >> > > > 502.gcc_r   0.44%   -0.11%
> > >> > > > 505.mcf_r   0.17%   4.46%
> > >> > > > 520.omnetpp_r   0.28%   -0.27%
> > >> > > > 523.xalancbmk_r 0.00%   5.93%
> > >> > > > 525.x264_r  -0.09%  23.53%
> > >> > > > 531.deepsjeng_r 0.19%   0.00%
> > >> > > > 541.leela_r 0.22%   0.00%
> > >> > > > 548.exchange2_r -11.54% -22.34%
> > >> > > > 557

Re: [PATCH] doc: Add more alias option and reorder Intel CPU -march documentation

2024-09-18 Thread Hongtao Liu
On Wed, Sep 18, 2024 at 1:35 PM Haochen Jiang  wrote:
>
> Hi all,
>
> Since r15-3539, there are requests coming in to add other alias option
> documentation. This patch will add all ot them, including corei7, corei7-avx,
> core-avx-i, core-avx2, atom, slm, gracemont and emerarldrapids.
>
> Also in the patch, I reordered that part of documentation, currently all
> the CPUs/products are just all over the place. I regrouped them by
> date-to-now products (since the very first CPU to latest Panther Lake), P-core
> (since the clients become hybrid cores, starting from Sapphire Rapids) and
> E-core (since Bonnell to latest Clearwater Forest).
>
> And in the patch, I refined the product names in documentation.
>
> Ok for trunk?
Ok, please backport to release branch.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * doc/invoke.texi: Add corei7, corei7-avx, core-avx-i,
> core-avx2, atom, slm, gracemont and emerarldrapids. Reorder
> the -march documentation by splitting them into date-to-now
> products, P-core and E-core. Refine the product names in
> documentation.
> ---
>  gcc/doc/invoke.texi | 234 +++-
>  1 file changed, 121 insertions(+), 113 deletions(-)
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index a6cd5111d47..23e1d8577e7 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -34598,6 +34598,7 @@ Intel Core 2 CPU with 64-bit extensions, MMX, SSE, 
> SSE2, SSE3, SSSE3, CX16,
>  SAHF and FXSR instruction set support.
>
>  @item nehalem
> +@itemx corei7
>  Intel Nehalem CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
>  SSE4.1, SSE4.2, POPCNT, CX16, SAHF and FXSR instruction set support.
>
> @@ -34606,16 +34607,19 @@ Intel Westmere CPU with 64-bit extensions, MMX, 
> SSE, SSE2, SSE3, SSSE3,
>  SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR and PCLMUL instruction set support.
>
>  @item sandybridge
> +@itemx corei7-avx
>  Intel Sandy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
>  SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE and PCLMUL instruction 
> set
>  support.
>
>  @item ivybridge
> +@itemx core-avx-i
>  Intel Ivy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
>  SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND
>  and F16C instruction set support.
>
>  @item haswell
> +@itemx core-avx2
>  Intel Haswell CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
>  SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, 
> RDRND,
>  F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE and HLE instruction set support.
> @@ -34632,61 +34636,6 @@ SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, 
> XSAVE, PCLMUL, FSGSBASE, RDRND,
>  F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW, AES,
>  CLFLUSHOPT, XSAVEC, XSAVES and SGX instruction set support.
>
> -@item bonnell
> -Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and 
> SSSE3
> -instruction set support.
> -
> -@item silvermont
> -Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, 
> SSSE3,
> -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW and RDRND
> -instruction set support.
> -
> -@item goldmont
> -Intel Goldmont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, 
> SSSE3,
> -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, AES, SHA,
> -RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT and FSGSBASE instruction
> -set support.
> -
> -@item goldmont-plus
> -Intel Goldmont Plus CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> -SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, 
> AES,
> -SHA, RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT, FSGSBASE, PTWRITE,
> -RDPID and SGX instruction set support.
> -
> -@item tremont
> -Intel Tremont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
> -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, AES, SHA,
> -RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT, FSGSBASE, PTWRITE, 
> RDPID,
> -SGX, CLWB, GFNI-SSE, MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG instruction set
> -support.
> -
> -@item sierraforest
> -Intel Sierra Forest CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> -SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC,
> -XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
> -MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
> -PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD and UINTR instruction 
> set
> -support.
> -
> -@item grandridge
> -Intel Grand Ridge CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> -SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC,
> -XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
> -MOVDIR64B, CLDEMOTE,

Re: [PATCH] i386: Add missing avx512f-mask-type.h include

2024-09-18 Thread Hongtao Liu
On Wed, Sep 18, 2024 at 1:40 PM Haochen Jiang  wrote:
>
> Hi all,
>
> Since commit r15-3594, we fixed the bugs in MASK_TYPE for AVX10.2
> testcases, but we missed the following four.
>
> The tests are not FAIL since the binutils part haven't been merged
> yet, which leads to UNSUPPORTED test. But the avx512f-mask-type.h
> needs to be included, otherwise, it will be compile error.
>
> Tested with asseblmer having those insts and sde. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-512-vpdpbssd-2.c: Include
> avx512f-mask-type.h.
> * gcc.target/i386/avx10_2-vminmaxsd-2.c: Ditto.
> * gcc.target/i386/avx10_2-vminmaxsh-2.c: Ditto.
> * gcc.target/i386/avx10_2-vminmaxss-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c | 2 ++
>  gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c| 1 +
>  gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c| 1 +
>  gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c| 1 +
>  4 files changed, 5 insertions(+)
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c
> index add9de89351..624a1a8e50e 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c
> @@ -13,6 +13,8 @@
>  #define SRC_SIZE (AVX512F_LEN / 8)
>  #define SIZE (AVX512F_LEN / 32)
>
> +#include "avx512f-mask-type.h"
> +
>  static void
>  CALC (int *r, int *dst, char *s1, char *s2)
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c
> index 1e2d78c4068..f550e09be6c 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c
> @@ -8,6 +8,7 @@
>  #include "avx10-helper.h"
>  #include 
>  #include "avx10-minmax-helper.h"
> +#include "avx512f-mask-type.h"
>
>  void static
>  CALC (double *r, double *s1, double *s2, int R)
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c
> index e6a93c403b5..dbf1087d9c3 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c
> @@ -8,6 +8,7 @@
>  #include "avx10-helper.h"
>  #include 
>  #include "avx10-minmax-helper.h"
> +#include "avx512f-mask-type.h"
>
>  void static
>  CALC (_Float16 *r, _Float16 *s1, _Float16 *s2, int R)
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c
> index 47177e69640..7baa396a2d3 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c
> @@ -8,6 +8,7 @@
>  #include "avx10-helper.h"
>  #include 
>  #include "avx10-minmax-helper.h"
> +#include "avx512f-mask-type.h"
>
>  void static
>  CALC (float *r, float *s1, float *s2, int R)
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Enhance AVX10.2 convert tests

2024-09-18 Thread Hongtao Liu
On Wed, Sep 18, 2024 at 1:42 PM Haochen Jiang  wrote:
>
> Hi all,
>
> For AVX10.2 convert tests, all of them are missing mask tests
> previously, this patch will add them in the tests.
>
> Tested on sde with assembler with these insts. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c: Enhance mask test.
> * gcc.target/i386/avx10_2-512-vcvthf82ph-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c: Ditto.
> * gcc.target/i386/avx512f-helper.h: Fix a typo in macro define.
> ---
>  .../i386/avx10_2-512-vcvt2ps2phx-2.c  | 35 ---
>  .../i386/avx10_2-512-vcvthf82ph-2.c   | 27 ++
>  .../i386/avx10_2-512-vcvtne2ph2bf8-2.c| 25 ++---
>  .../i386/avx10_2-512-vcvtne2ph2bf8s-2.c   | 25 ++---
>  .../i386/avx10_2-512-vcvtne2ph2hf8-2.c| 25 ++---
>  .../i386/avx10_2-512-vcvtne2ph2hf8s-2.c   | 25 ++---
>  .../i386/avx10_2-512-vcvtneph2bf8-2.c | 29 ++-
>  .../i386/avx10_2-512-vcvtneph2bf8s-2.c| 27 ++
>  .../i386/avx10_2-512-vcvtneph2hf8-2.c | 27 ++
>  .../i386/avx10_2-512-vcvtneph2hf8s-2.c| 27 ++
>  .../gcc.target/i386/avx512f-helper.h  |  2 +-
>  11 files changed, 209 insertions(+), 65 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c
> index 40dbe18abbe..5e355ae53d4 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c
> @@ -10,24 +10,25 @@
>  #include "avx10-helper.h"
>  #include 
>
> -#define SIZE_RES (AVX512F_LEN / 16)
> +#define SIZE (AVX512F_LEN / 16)
> +#include "avx512f-mask-type.h"
>
>  static void
>  CALC (_Float16 *res_ref, float *src1, float *src2)
>  {
>float fp32;
>int i;
> -  for (i = 0; i < SIZE_RES / 2; i++)
> +  for (i = 0; i < SIZE / 2; i++)
>  {
>fp32 = (float) 2 * i + 7 + i * 0.5;
>res_ref[i] = fp32;
>src2[i] = fp32;
>  }
> -  for (i = SIZE_RES / 2; i < SIZE_RES; i++)
> +  for (i = SIZE / 2; i < SIZE; i++)
>  {
>fp32 = (float)2 * i + 7 + i * 0.5;
>res_ref[i] = fp32;
> -  src1[i - (SIZE_RES / 2)] = fp32;
> +  src1[i - (SIZE / 2)] = fp32;
>  }
>  }
>
> @@ -35,17 +36,27 @@ void
>  TEST (void)
>  {
>int i;
> -  UNION_TYPE (AVX512F_LEN, h) res1;
> +  UNION_TYPE (AVX512F_LEN, h) res1, res2, res3;
>UNION_TYPE (AVX512F_LEN, ) src1, src2;
> -  _Float16 res_ref[SIZE_RES];
> -  float fp32;
> -
> -  for (i = 0; i < SIZE_RES; i++)
> -res1.a[i] = 5;
> -
> +  MASK_TYPE mask = MASK_VALUE;
> +  _Float16 res_ref[SIZE];
> +
> +  for (i = 0; i < SIZE; i++)
> +res2.a[i] = DEFAULT_VALUE;
> +
>CALC (res_ref, src1.a, src2.a);
> -
> +
>res1.x = INTRINSIC (_cvtx2ps_ph) (src1.x, src2.x);
>if (UNION_CHECK (AVX512F_LEN, h) (res1, res_ref))
>  abort ();
> +
> +  res2.x = INTRINSIC (_mask_cvtx2ps_ph) (res2.x, mask, src1.x, src2.x);
> +  MASK_MERGE (h) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, h) (res2, res_ref))
> +abort ();
> +
> +  res3.x = INTRINSIC (_maskz_cvtx2ps_ph) (mask, src1.x, src2.x);
> +  MASK_ZERO (h) (res_ref, mask, SIZE);
> +  if (UNION_CHECK (AVX512F_LEN, h) (res3, res_ref))
> +abort ();
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c
> index 6b9f07ff86a..1aa5daa6c58 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c
> @@ -12,13 +12,14 @@
>  #include "fp8-helper.h"
>
>  #define SIZE_SRC (AVX512F_LEN_HALF / 8)
> -#define SIZE_RES (AVX512F_LEN / 16)
> +#define SIZE (AVX512F_LEN / 16)
> +#include "avx512f-mask-type.h"
>
>  void
>  CALC (_Float16 *r, unsigned char *s)
>  {
>int i;
> -  for (i = 0; i < SIZE_RES; i++)
> +  for (i = 0; i < SIZE; i++)
>  r[i] = convert_hf8_to_fp16(s[i]);
>  }
>
> @@ -26,9 +27,10 @@ void
>  TEST (void)
>  {
>int i,sign;
> -  UNION_TYPE (AVX512F_LEN, h) res;
> +  UNION_TYPE (AVX512F_LEN, h) res1, res2, res3;
>UNION_TYPE (AVX512F_LEN_HALF, i_b) src;
> -  _Float16 res_ref[SIZE_RES];
> +  MASK_TYPE mask = MASK_VALUE;
> +  _Float16 res_ref[SIZE];
>
>sign = 1;
>for (i = 0; i < SIZE_SRC; i++)
> @@ -37,9 +39,22 @@ TEST (void)
>sign = -sign;
>   

Re: [PATCH] i386: Add ssemov2, sseicvt2 for some load instructions that use memory on operand2

2024-09-18 Thread Hongtao Liu
On Thu, Sep 19, 2024 at 9:34 AM Hu, Lin1  wrote:
>
> Hi, all
>
> The memory attr of some instructions should be 'load', but these is 'none'
> currently.
>
> This patch add two new types ssemov2, sseicvt2 for some load instructions that
> use memory on operands. So their memory attr will be 'load'.
>
> Bootstrapped and Regtested on x86-64-pc-linux-gnu, OK for trunk?
Ok.
>
> BRs
> Lin
>
> gcc/ChangeLog:
>
> * config/i386/i386.md: Add ssemov2, sseicvt2.
> * config/i386/sse.md (sse2_cvtsi2sd): Apply sseicvt2.
> (sse2_cvtsi2sdq): Ditto.
> (vec_set_0): Apply ssemov2 for 4, 6.
> ---
>  gcc/config/i386/i386.md | 11 +++
>  gcc/config/i386/sse.md  |  6 --
>  2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index c0441514949..9c2a0aa6112 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -539,10 +539,10 @@ (define_attr "type"
> str,bitmanip,
> fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp,
> fxch,fistp,fisttp,frndint,
> -   sse,ssemov,sseadd,sseadd1,sseiadd,sseiadd1,
> +   sse,ssemov,ssemov2,sseadd,sseadd1,sseiadd,sseiadd1,
> ssemul,sseimul,ssediv,sselog,sselog1,
> sseishft,sseishft1,ssecmp,ssecomi,
> -   ssecvt,ssecvt1,sseicvt,sseins,
> +   ssecvt,ssecvt1,sseicvt,sseicvt2,sseins,
> sseshuf,sseshuf1,ssemuladd,sse4arg,
> lwp,mskmov,msklog,
> mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft"
> @@ -560,10 +560,10 @@ (define_attr "unit" "integer,i387,sse,mmx,unknown"
>(cond [(eq_attr "type" "fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp,
>   fxch,fistp,fisttp,frndint")
>(const_string "i387")
> -(eq_attr "type" "sse,ssemov,sseadd,sseadd1,sseiadd,sseiadd1,
> +(eq_attr "type" "sse,ssemov,ssemov2,sseadd,sseadd1,sseiadd,sseiadd1,
>   ssemul,sseimul,ssediv,sselog,sselog1,
>   sseishft,sseishft1,ssecmp,ssecomi,
> - ssecvt,ssecvt1,sseicvt,sseins,
> + ssecvt,ssecvt1,sseicvt,sseicvt2,sseins,
>   sseshuf,sseshuf1,ssemuladd,sse4arg,mskmov")
>(const_string "sse")
>  (eq_attr "type" "mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft")
> @@ -858,6 +858,9 @@ (define_attr "memory" "none,load,store,both,unknown"
>mmx,mmxmov,mmxcmp,mmxcvt,mskmov,msklog")
>   (match_operand 2 "memory_operand"))
>(const_string "load")
> +(and (eq_attr "type" "ssemov2,sseicvt2")
> + (match_operand 2 "memory_operand"))
> +  (const_string "load")
>  (and (eq_attr "type" "icmov,ssemuladd,sse4arg")
>   (match_operand 3 "memory_operand"))
>(const_string "load")
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 1ae61182d0c..ff4f33b7b63 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -8876,7 +8876,7 @@ (define_insn "sse2_cvtsi2sd"
> cvtsi2sd{l}\t{%2, %0|%0, %2}
> vcvtsi2sd{l}\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> -   (set_attr "type" "sseicvt")
> +   (set_attr "type" "sseicvt2")
> (set_attr "athlon_decode" "double,direct,*")
> (set_attr "amdfam10_decode" "vector,double,*")
> (set_attr "bdver1_decode" "double,direct,*")
> @@ -8898,7 +8898,7 @@ (define_insn "sse2_cvtsi2sdq"
> cvtsi2sd{q}\t{%2, %0|%0, %2}
> vcvtsi2sd{q}\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> -   (set_attr "type" "sseicvt")
> +   (set_attr "type" "sseicvt2")
> (set_attr "athlon_decode" "double,direct,*")
> (set_attr "amdfam10_decode" "vector,double,*")
> (set_attr "bdver1_decode" "double,direct,*")
> @@ -11808,6 +11808,8 @@ (define_insn "vec_set_0"
>   (const_string "imov")
> (eq_attr "alternative" "14")
>   (const_string "fmov")
> +   (eq_attr "alternative" "4,6")
> + (const_string "ssemov2")
>]
>(const_string "ssemov")))
> (set (attr "addr")
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-09-12 Thread Hongtao Liu
On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu  wrote:
>
> On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
>  wrote:
> >
> > On Wed, Sep 11, 2024 at 4:17 AM liuhongt  wrote:
> > >
> > > GCC12 enables vectorization for O2 with very cheap cost model which is 
> > > restricted
> > > to constant tripcount. The vectorization capacity is very limited w/ 
> > > consideration
> > > of codesize impact.
> > >
> > > The patch extends the very cheap cost model a little bit to support 
> > > variable tripcount.
> > > But still disable peeling for gaps/alignment, runtime aliasing checking 
> > > and epilogue
> > > vectorization with the consideration of codesize.
> > >
> > > So there're at most 2 versions of loop for O2 vectorization, one 
> > > vectorized main loop
> > > , one scalar/remainder loop.
> > >
> > > .i.e.
> > >
> > > void
> > > foo1 (int* __restrict a, int* b, int* c, int n)
> > > {
> > >  for (int i = 0; i != n; i++)
> > >   a[i] = b[i] + c[i];
> > > }
> > >
> > > with -O2 -march=x86-64-v3, will be vectorized to
> > >
> > > .L10:
> > > vmovdqu (%r8,%rax), %ymm0
> > > vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > > vmovdqu %ymm0, (%rdi,%rax)
> > > addq$32, %rax
> > > cmpq%rdx, %rax
> > > jne .L10
> > > movl%ecx, %eax
> > > andl$-8, %eax
> > > cmpl%eax, %ecx
> > > je  .L21
> > > vzeroupper
> > > .L12:
> > > movl(%r8,%rax,4), %edx
> > > addl(%rsi,%rax,4), %edx
> > > movl%edx, (%rdi,%rax,4)
> > > addq$1, %rax
> > > cmpl%eax, %ecx
> > > jne .L12
> > >
> > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance 
> > > by 4.11%
> > > with extra 2.8% codeisze, and cheap cost model improve performance by 
> > > 5.74% with
> > > extra 8.88% codesize. The details are as below
> >
> > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > model numbers?
> No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> >
> > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >
> > > N-Iter  cheap cost model
> > > 500.perlbench_r -0.12%  -0.12%
> > > 502.gcc_r   0.44%   -0.11%
> > > 505.mcf_r   0.17%   4.46%
> > > 520.omnetpp_r   0.28%   -0.27%
> > > 523.xalancbmk_r 0.00%   5.93%
> > > 525.x264_r  -0.09%  23.53%
> > > 531.deepsjeng_r 0.19%   0.00%
> > > 541.leela_r 0.22%   0.00%
> > > 548.exchange2_r -11.54% -22.34%
> > > 557.xz_r0.74%   0.49%
> > > GEOMEAN INT -1.04%  0.60%
> > >
> > > 503.bwaves_r3.13%   4.72%
> > > 507.cactuBSSN_r 1.17%   0.29%
> > > 508.namd_r  0.39%   6.87%
> > > 510.parest_r3.14%   8.52%
> > > 511.povray_r0.10%   -0.20%
> > > 519.lbm_r   -0.68%  10.14%
> > > 521.wrf_r   68.20%  76.73%
> >
> > So this seems to regress as well?
> Niter increases performance less than the cheap cost model, that's
> expected, it is not a regression.
> >
> > > 526.blender_r   0.12%   0.12%
> > > 527.cam4_r  19.67%  23.21%
> > > 538.imagick_r   0.12%   0.24%
> > > 544.nab_r   0.63%   0.53%
> > > 549.fotonik3d_r 14.44%  9.43%
> > > 554.roms_r  12.39%  0.00%
> > > GEOMEAN FP  8.26%   9.41%
> > > GEOMEAN ALL 4.11%   5.74%

I've tested the patch on aarch64, it shows similar improvement with
little codesize increasement.
I haven't tested it on other backends, but I think it would have
similar good improvements
> > >
> > > Code sise impact
> > > N-Iter  cheap cost model
> > > 500.perlbench_r 0.22%   1.03%
> > > 502.gcc_r   0.25%   0.60%
> > > 505.mcf_r   0.00%   32.07%
> > > 520.omnetpp_r   0.09%   0.31%
> > > 523.xalancbmk_r 0.08%   1.86%
> > > 525.x264_r  0.75%   7.96%
> > > 531.deepsjeng_

Re: [PATCH v2] Enable V2BF/V4BF vec_cmp with AVX10.2 vcmppbf16

2024-09-11 Thread Hongtao Liu
On Thu, Sep 12, 2024 at 9:55 AM Levy Hsu  wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_get_mask_mode):
> Enable BFmode for targetm.vectorize.get_mask_mode with AVX10.2.
> * config/i386/mmx.md (vec_cmpqi):
> Implement vec_cmpv2bfqi and vec_cmpv4bfqi.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/part-vect-vec_cmpbf.c: New test.
> ---
>  gcc/config/i386/i386.cc   |  3 ++-
>  gcc/config/i386/mmx.md| 17 
>  .../gcc.target/i386/part-vect-vec_cmpbf.c | 26 +++
>  3 files changed, 45 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 45320124b91..7dbae1d72e3 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24682,7 +24682,8 @@ ix86_get_mask_mode (machine_mode data_mode)
>/* AVX512FP16 only supports vector comparison
>  to kmask for _Float16.  */
>|| (TARGET_AVX512VL && TARGET_AVX512FP16
> - && GET_MODE_INNER (data_mode) == E_HFmode))
> + && GET_MODE_INNER (data_mode) == E_HFmode)
> +  || (TARGET_AVX10_2_256 && GET_MODE_INNER (data_mode) == E_BFmode))
>  {
>if (elem_size == 4
>   || elem_size == 8
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index 4bc191b874b..95d9356694a 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2290,6 +2290,23 @@
>DONE;
>  })
>
> +;;This instruction does not generate floating point exceptions
> +(define_expand "vec_cmpqi"
> +  [(set (match_operand:QI 0 "register_operand")
> +   (match_operator:QI 1 ""
> + [(match_operand:VBF_32_64 2 "register_operand")
> +  (match_operand:VBF_32_64 3 "nonimmediate_operand")]))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op2 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[2]), mode);
> +  rtx op3 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[3]), mode);
> +
> +  emit_insn (gen_vec_cmpv8bfqi (operands[0], operands[1], op2, op3));
> +  DONE;
> +})
> +
>  ;
>  ;;
>  ;; Parallel half-precision floating point rounding operations.
> diff --git a/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c 
> b/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c
> new file mode 100644
> index 000..0bb720b6432
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mavx10.2" } */
> +/* { dg-final { scan-assembler-times "vcmppbf16" 10 } } */
> +
> +typedef __bf16 __attribute__((__vector_size__ (4))) v2bf;
> +typedef __bf16 __attribute__((__vector_size__ (8))) v4bf;
> +
> +
> +#define VCMPMN(type, op, name) \
> +type  \
> +__attribute__ ((noinline, noclone)) \
> +vec_cmp_##type##type##name (type a, type b) \
> +{ \
> +  return a op b;  \
> +}
> +
> +VCMPMN (v4bf, <, lt)
> +VCMPMN (v2bf, <, lt)
> +VCMPMN (v4bf, <=, le)
> +VCMPMN (v2bf, <=, le)
> +VCMPMN (v4bf, >, gt)
> +VCMPMN (v2bf, >, gt)
> +VCMPMN (v4bf, >=, ge)
> +VCMPMN (v2bf, >=, ge)
> +VCMPMN (v4bf, ==, eq)
> +VCMPMN (v2bf, ==, eq)
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-09-11 Thread Hongtao Liu
On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
 wrote:
>
> On Wed, Sep 11, 2024 at 4:17 AM liuhongt  wrote:
> >
> > GCC12 enables vectorization for O2 with very cheap cost model which is 
> > restricted
> > to constant tripcount. The vectorization capacity is very limited w/ 
> > consideration
> > of codesize impact.
> >
> > The patch extends the very cheap cost model a little bit to support 
> > variable tripcount.
> > But still disable peeling for gaps/alignment, runtime aliasing checking and 
> > epilogue
> > vectorization with the consideration of codesize.
> >
> > So there're at most 2 versions of loop for O2 vectorization, one vectorized 
> > main loop
> > , one scalar/remainder loop.
> >
> > .i.e.
> >
> > void
> > foo1 (int* __restrict a, int* b, int* c, int n)
> > {
> >  for (int i = 0; i != n; i++)
> >   a[i] = b[i] + c[i];
> > }
> >
> > with -O2 -march=x86-64-v3, will be vectorized to
> >
> > .L10:
> > vmovdqu (%r8,%rax), %ymm0
> > vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > vmovdqu %ymm0, (%rdi,%rax)
> > addq$32, %rax
> > cmpq%rdx, %rax
> > jne .L10
> > movl%ecx, %eax
> > andl$-8, %eax
> > cmpl%eax, %ecx
> > je  .L21
> > vzeroupper
> > .L12:
> > movl(%r8,%rax,4), %edx
> > addl(%rsi,%rax,4), %edx
> > movl%edx, (%rdi,%rax,4)
> > addq$1, %rax
> > cmpl%eax, %ecx
> > jne .L12
> >
> > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 
> > 4.11%
> > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% 
> > with
> > extra 8.88% codesize. The details are as below
>
> I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> model numbers?
No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>
> > Performance measured with -march=x86-64-v3 -O2 on EMR
> >
> > N-Iter  cheap cost model
> > 500.perlbench_r -0.12%  -0.12%
> > 502.gcc_r   0.44%   -0.11%
> > 505.mcf_r   0.17%   4.46%
> > 520.omnetpp_r   0.28%   -0.27%
> > 523.xalancbmk_r 0.00%   5.93%
> > 525.x264_r  -0.09%  23.53%
> > 531.deepsjeng_r 0.19%   0.00%
> > 541.leela_r 0.22%   0.00%
> > 548.exchange2_r -11.54% -22.34%
> > 557.xz_r0.74%   0.49%
> > GEOMEAN INT -1.04%  0.60%
> >
> > 503.bwaves_r3.13%   4.72%
> > 507.cactuBSSN_r 1.17%   0.29%
> > 508.namd_r  0.39%   6.87%
> > 510.parest_r3.14%   8.52%
> > 511.povray_r0.10%   -0.20%
> > 519.lbm_r   -0.68%  10.14%
> > 521.wrf_r   68.20%  76.73%
>
> So this seems to regress as well?
Niter increases performance less than the cheap cost model, that's
expected, it is not a regression.
>
> > 526.blender_r   0.12%   0.12%
> > 527.cam4_r  19.67%  23.21%
> > 538.imagick_r   0.12%   0.24%
> > 544.nab_r   0.63%   0.53%
> > 549.fotonik3d_r 14.44%  9.43%
> > 554.roms_r  12.39%  0.00%
> > GEOMEAN FP  8.26%   9.41%
> > GEOMEAN ALL 4.11%   5.74%
> >
> > Code sise impact
> > N-Iter  cheap cost model
> > 500.perlbench_r 0.22%   1.03%
> > 502.gcc_r   0.25%   0.60%
> > 505.mcf_r   0.00%   32.07%
> > 520.omnetpp_r   0.09%   0.31%
> > 523.xalancbmk_r 0.08%   1.86%
> > 525.x264_r  0.75%   7.96%
> > 531.deepsjeng_r 0.72%   3.28%
> > 541.leela_r 0.18%   0.75%
> > 548.exchange2_r 8.29%   12.19%
> > 557.xz_r0.40%   0.60%
> > GEOMEAN INT 1.07%%  5.71%
> >
> > 503.bwaves_r12.89%  21.59%
> > 507.cactuBSSN_r 0.90%   20.19%
> > 508.namd_r  0.77%   14.75%
> > 510.parest_r0.91%   3.91%
> > 511.povray_r0.45%   4.08%
> > 519.lbm_r   0.00%   0.00%
> > 521.wrf_r   5.97%   12.79%
> > 526.blender_r   0.49%   3.84%
> > 527.cam4_r  1.39%   3.28%
> > 538.imagick_r   1.86%   7.78%
> > 544.nab_r   0.41%   3.00%
> > 549.fotonik3d_r 25.50%  47.47%
> > 554.roms_r  5.17%   13.01%
> > GEOMEAN FP  4.14%   11.38%
> > GEOMEAN ALL 2.80%   8.88%
> >
> >
> > The only regression is from 548.exchange_r, the vectorization for inner 
> > loop in each layer
> > of the 9-layer loops increases register pressure and causes more spill.
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > .
> > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > ...
> > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >
> > Looks like aarch64 doesn't have the issue because aarch64 has 

Re: [PATCH] i386: Fix incorrect avx512f-mask-type.h include

2024-09-10 Thread Hongtao Liu
On Thu, Sep 5, 2024 at 10:05 AM Haochen Jiang  wrote:
>
> Hi all,
>
> In avx512f-mask-type.h, we need SIZE being defined to get
> MASK_TYPE defined correctly. Fix those testcases where
> SIZE are not defined before the include for avv512f-mask-type.h.
>
> Note that for convert intrins in AVX10.2, they will need more
> modifications due to the current tests did not include mask ones.
> They will be in a seperate patch.
>
> Tested on x86-64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10-helper.h: Do not include
> avx512f-mask-type.h.
> * gcc.target/i386/avx10_2-512-vaddnepbf16-2.c:
> Define SIZE and include avx512f-mask-type.h.
> * gcc.target/i386/avx10_2-512-vcmppbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtnebf162ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtnebf162iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtph2ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtph2iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtps2ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvtps2iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttnebf162ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttnebf162iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttpd2dqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttpd2qqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttpd2udqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttpd2uqqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttph2ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttph2iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2dqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2ibs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2iubs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2qqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2udqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vcvttps2uqqs-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vdivnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vdpphps-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vfmaddXXXnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vfmsubXXXnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vfnmaddXXXnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vfnmsubXXXnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vfpclasspbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vgetexppbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vgetmantpbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vmaxpbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vminmaxnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vminmaxpd-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vminmaxph-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vminmaxps-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vminpbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vmpsadbw-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vmulnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbssd-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbssds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbsud-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbsuds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbuud-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpbuuds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwsud-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwsuds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwusd-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwusds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwuud-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vpdpwuuds-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vrcppbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vreducenepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vrndscalenepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vrsqrtpbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vscalefpbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vsqrtnepbf16-2.c: Ditto.
> * gcc.target/i386/avx10_2-512-vsubnepbf16-2.c: Ditto.
> * gcc.target/i386/avx512fp16-vfpclassph-1b.c: Ditto.
> ---
>  gcc/testsuite/gcc.target/i386/avx10-helper.h  |  1 -
>  .../i386/avx10_2-512-vaddnepbf16-2.c  | 11 +-
>  .../gcc.target/i386/avx10_2-512-vcmppbf16-2.c |  5 +++--
>  .../i386/avx10_2-512-vcvtnebf162ibs-2.c   | 16 +++---
>  .../i386/avx10_2-512-vcvtnebf162iubs-2.c  | 16 +++---
>  .../i386/avx10_2-512-vcvtph2ibs-2.c   | 16 +++---
>  .../i386/avx10_2-512-vcvtph2iubs-2.c  | 16 +++---
>  .../i386/avx10_2-512-vcvtps2ibs-2.c   | 16 +++---
>  .../i386/avx10_2-

Re: [PATCH] x86: Refine V4BF/V2BF FMA Testcase

2024-09-10 Thread Hongtao Liu
On Tue, Sep 10, 2024 at 3:35 PM Levy Hsu  wrote:
>
> Simple testcase fix, ok for trunk?
Ok.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: Separated 32-bit 
> scan
> and removed register checks in spill situations.
> ---
>  .../i386/avx10_2-partial-bf-vector-fma-1.c   | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> index 72e17e99603..8a9096a300a 100644
> --- a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> @@ -1,9 +1,13 @@
>  /* { dg-do compile } */
>  /* { dg-options "-mavx10.2 -O2" } */
> -/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> -/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 3 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 3 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 3 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 3 { target ia32 } } } */
> +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 2 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 2 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 2 { target { ! ia32 } } } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 
> 2 { target { ! ia32 } } } } */
>
>  typedef __bf16 v4bf __attribute__ ((__vector_size__ (8)));
>  typedef __bf16 v2bf __attribute__ ((__vector_size__ (4)));
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] x86: Refine V4BF/V2BF FMA testcase

2024-09-05 Thread Hongtao Liu
On Fri, Sep 6, 2024 at 10:34 AM Jiang, Haochen  wrote:
>
> > From: Levy Hsu 
> > Sent: Thursday, September 5, 2024 4:55 PM
> > To: gcc-patches@gcc.gnu.org
> >
> > Simple testcase fix, ok for trunk?
> >
> > This patch removes specific register checks to account for possible
> > register spills and disables tests in 32-bit mode. This adjustment
> > is necessary because V4BF operations in 32-bit mode require duplicating
> > instructions, which lead to unintended test failures. It fixed the
> > case when testing with --target_board='unix{-m32\ -march=cascadelake}'
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: Remove specific
> > register checks to account for potential register spills. Exclude 
> > tests
> > in 32-bit mode to prevent incorrect failure reports due to the need 
> > for
> > multiple instruction executions in handling V4BF operations.
> > ---
> >  .../gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> > index 72e17e99603..17c32c1d36b 100644
> > --- a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
> > @@ -1,9 +1,9 @@
> >  /* { dg-do compile } */
>
> You could simply add { target { ! ia32 } } here, but not each line of
> scan-assembler-times.
It can be compiled at target  ia32, I guess for ia32, fma instructions
can be scanned for 3 times(1 for original 32-bit vector fma, 2 from
split 64-bit vector fma to 2 32-bit vector fma)
So better to scan 2 fma for ! ia32, 3 fma for ia32?

>
> I don't think we need this test been run for -m32 due to V4BF. Actually
> the better choice is to split the testcase to two part, for V2BF, I suppose
> it could be run under -m32.
>
> Thx,
> Haochen
>
> >  /* { dg-options "-mavx10.2 -O2" } */
> > -/* { dg-final { scan-assembler-times
> > "vfmadd132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-
> > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */
> > -/* { dg-final { scan-assembler-times
> > "vfmsub132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-
> > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */
> > -/* { dg-final { scan-assembler-times
> > "vfnmadd132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-
> > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */
> > -/* { dg-final { scan-assembler-times
> > "vfnmsub132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-
> > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */
> > +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0-
> > 9\]" 2 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0-
> > 9\]" 2 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times
> > "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 2 { target { ! ia32 } } } } */
> > +/* { dg-final { scan-assembler-times
> > "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 2 { target { ! ia32 } } } } */
> >
> >  typedef __bf16 v4bf __attribute__ ((__vector_size__ (8)));
> >  typedef __bf16 v2bf __attribute__ ((__vector_size__ (4)));
> > --
> > 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Integrate BFmode for Enhanced Vectorization in ix86_preferred_simd_mode

2024-09-04 Thread Hongtao Liu
On Wed, Sep 4, 2024 at 9:32 AM Levy Hsu  wrote:
>
> Hi
>
> This change adds BFmode support to the ix86_preferred_simd_mode function
> enhancing SIMD vectorization for BF16 operations. The update ensures
> optimized usage of SIMD capabilities improving performance and aligning
> vector sizes with processor capabilities.
>
> Bootstrapped and tested on x86-64-pc-linux-gnu.
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_preferred_simd_mode): Add BFmode Support.
> ---
>  gcc/config/i386/i386.cc | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 7af9ceca429..aea138c85ad 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24570,6 +24570,14 @@ ix86_preferred_simd_mode (scalar_mode mode)
> }
>return word_mode;
>
> +case E_BFmode:
> +  if (TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256)
> +   return V32BFmode;
> +  else if (TARGET_AVX && !TARGET_PREFER_AVX128)
> +   return V16BFmode;
> +  else
> +   return V8BFmode;
> +
>  case E_SFmode:
>if (TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256)
> return V16SFmode;
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Support partial signbit/xorsign/copysign/abs/neg/and/xor/ior/andn for V2BF/V4BF

2024-09-04 Thread Hongtao Liu
On Wed, Sep 4, 2024 at 10:53 AM Levy Hsu  wrote:
>
> Hi
>
> This patch adds support for bf16 operations in V2BF and V4BF modes on i386,
> handling signbit, xorsign, copysign, abs, neg, and various logical operations.
>
> Bootstrapped and tested on x86-64-pc-linux-gnu.
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_build_const_vector): Add V2BF/V4BF.
> (ix86_build_signbit_mask): Add V2BF/V4BF.
> * config/i386/mmx.md: Modified supported logic op to use VHBF_32_64.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/part-vect-absnegbf.c: New test.
> ---
>  gcc/config/i386/i386.cc   |  4 +
>  gcc/config/i386/mmx.md| 74 +
>  .../gcc.target/i386/part-vect-absnegbf.c  | 81 +++
>  3 files changed, 124 insertions(+), 35 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-absnegbf.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 78bf890f14b..2bbfb1bf5fc 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -16176,6 +16176,8 @@ ix86_build_const_vector (machine_mode mode, bool 
> vect, rtx value)
>  case E_V32BFmode:
>  case E_V16BFmode:
>  case E_V8BFmode:
> +case E_V4BFmode:
> +case E_V2BFmode:
>n_elt = GET_MODE_NUNITS (mode);
>v = rtvec_alloc (n_elt);
>scalar_mode = GET_MODE_INNER (mode);
> @@ -16215,6 +16217,8 @@ ix86_build_signbit_mask (machine_mode mode, bool 
> vect, bool invert)
>  case E_V32BFmode:
>  case E_V16BFmode:
>  case E_V8BFmode:
> +case E_V4BFmode:
> +case E_V2BFmode:
>vec_mode = mode;
>imode = HImode;
>break;
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index cb2697537a8..44adcd8d8e0 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -121,7 +121,7 @@
>  ;; Mapping of vector float modes to an integer mode of the same size
>  (define_mode_attr mmxintvecmode
>[(V2SF "V2SI") (V2SI "V2SI") (V4HI "V4HI") (V8QI "V8QI")
> -   (V4HF "V4HI") (V2HF "V2HI")])
> +   (V4HF "V4HI") (V2HF "V2HI") (V4BF "V4HI") (V2BF "V2HI")])
>
>  (define_mode_attr mmxintvecmodelower
>[(V2SF "v2si") (V2SI "v2si") (V4HI "v4hi") (V8QI "v8qi")
> @@ -2091,18 +2091,22 @@
>DONE;
>  })
>
> +(define_mode_iterator VHBF_32_64
> + [V2BF (V4BF "TARGET_MMX_WITH_SSE")
> +  V2HF (V4HF "TARGET_MMX_WITH_SSE")])
> +
>  (define_expand "2"
> -  [(set (match_operand:VHF_32_64 0 "register_operand")
> -   (absneg:VHF_32_64
> - (match_operand:VHF_32_64 1 "register_operand")))]
> +  [(set (match_operand:VHBF_32_64 0 "register_operand")
> +   (absneg:VHBF_32_64
> + (match_operand:VHBF_32_64 1 "register_operand")))]
>"TARGET_SSE"
>"ix86_expand_fp_absneg_operator (, mode, operands); DONE;")
>
>  (define_insn_and_split "*mmx_"
> -  [(set (match_operand:VHF_32_64 0 "register_operand" "=x,x,x")
> -   (absneg:VHF_32_64
> - (match_operand:VHF_32_64 1 "register_operand" "0,x,x")))
> -   (use (match_operand:VHF_32_64 2 "register_operand" "x,0,x"))]
> +  [(set (match_operand:VHBF_32_64 0 "register_operand" "=x,x,x")
> +   (absneg:VHBF_32_64
> + (match_operand:VHBF_32_64 1 "register_operand" "0,x,x")))
> +   (use (match_operand:VHBF_32_64 2 "register_operand" "x,0,x"))]
>"TARGET_SSE"
>"#"
>"&& reload_completed"
> @@ -2115,11 +2119,11 @@
>[(set_attr "isa" "noavx,noavx,avx")])
>
>  (define_insn_and_split "*mmx_nabs2"
> -  [(set (match_operand:VHF_32_64 0 "register_operand" "=x,x,x")
> -   (neg:VHF_32_64
> - (abs:VHF_32_64
> -   (match_operand:VHF_32_64 1 "register_operand" "0,x,x"
> -   (use (match_operand:VHF_32_64 2 "register_operand" "x,0,x"))]
> +  [(set (match_operand:VHBF_32_64 0 "register_operand" "=x,x,x")
> +   (neg:VHBF_32_64
> + (abs:VHBF_32_64
> +   (match_operand:VHBF_32_64 1 "register_operand" "0,x,x"
> +   (use (match_operand:VHBF_32_64 2 "register_operand" "x,0,x"))]
>"TARGET_SSE"
>"#"
>"&& reload_completed"
> @@ -2410,11 +2414,11 @@
>  ;
>
>  (define_insn "*mmx_andnot3"
> -  [(set (match_operand:VHF_32_64 0 "register_operand""=x,x")
> -   (and:VHF_32_64
> - (not:VHF_32_64
> -   (match_operand:VHF_32_64 1 "register_operand" "0,x"))
> - (match_operand:VHF_32_64 2 "register_operand"   "x,x")))]
> +  [(set (match_operand:VHBF_32_64 0 "register_operand""=x,x")
> +   (and:VHBF_32_64
> + (not:VHBF_32_64
> +   (match_operand:VHBF_32_64 1 "register_operand" "0,x"))
> + (match_operand:VHBF_32_64 2 "register_operand"   "x,x")))]
>"TARGET_SSE"
>"@
> andnps\t{%2, %0|%0, %2}
> @@ -2425,10 +2429,10 @@
> (set_attr "mode" "V4SF")])
>
>  (define_insn "3"
> -  [(set (match_operand:VHF_32_64 0 "register_operand"   "=x,x")
> -   (any_logic:VHF_32_64

Re: [PATCH] i386: Support partial vectorized FMA for V2BF/V4BF

2024-09-04 Thread Hongtao Liu
On Wed, Sep 4, 2024 at 11:31 AM Levy Hsu  wrote:
>
> Hi
>
> Bootstrapped and tested on x86-64-pc-linux-gnu.
> Ok for trunk?
Ok.
>
> This patch introduces support for vectorized FMA operations for bf16 types in
> V2BF and V4BF modes on the i386 architecture. New mode iterators and
> define_expand entries for fma, fnma, fms, and fnms operations are added in
> mmx.md, enhancing the i386 backend to handle these complex arithmetic 
> operations.
>
> gcc/ChangeLog:
>
> * config/i386/mmx.md (TARGET_MMX_WITH_SSE): New mode iterator 
> VBF_32_64
> (fma4): define_expand for V2BF/V4BF fma4.
> (fnma4): define_expand for V2BF/V4BF fnma4.
> (fms4): define_expand for V2BF/V4BF fms4.
> (fnms4): define_expand for V2BF/V4BF fnms4.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: New test.
> ---
>  gcc/config/i386/mmx.md| 84 ++-
>  .../i386/avx10_2-partial-bf-vector-fma-1.c| 57 +
>  2 files changed, 139 insertions(+), 2 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index 10fcd2beda6..22aeb43f436 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2636,6 +2636,88 @@
>DONE;
>  })
>
> +(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")])
> +
> +(define_expand "fma4"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +   (fma:VBF_32_64
> + (match_operand:VBF_32_64 1 "nonimmediate_operand")
> + (match_operand:VBF_32_64 2 "nonimmediate_operand")
> + (match_operand:VBF_32_64 3 "nonimmediate_operand")))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), 
> mode);
> +  rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), 
> mode);
> +  rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), 
> mode);
> +
> +  emit_insn (gen_fmav8bf4 (op0, op1, op2, op3));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
> +(define_expand "fms4"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +   (fma:VBF_32_64
> + (match_operand:VBF_32_64   1 "nonimmediate_operand")
> + (match_operand:VBF_32_64   2 "nonimmediate_operand")
> + (neg:VBF_32_64
> +   (match_operand:VBF_32_64 3 "nonimmediate_operand"]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), 
> mode);
> +  rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), 
> mode);
> +  rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), 
> mode);
> +
> +  emit_insn (gen_fmsv8bf4 (op0, op1, op2, op3));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
> +(define_expand "fnma4"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +   (fma:VBF_32_64
> + (neg:VBF_32_64
> +   (match_operand:VBF_32_64 1 "nonimmediate_operand"))
> + (match_operand:VBF_32_64   2 "nonimmediate_operand")
> + (match_operand:VBF_32_64   3 "nonimmediate_operand")))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), 
> mode);
> +  rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), 
> mode);
> +  rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), 
> mode);
> +
> +  emit_insn (gen_fnmav8bf4 (op0, op1, op2, op3));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
> +(define_expand "fnms4"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +   (fma:VBF_32_64
> + (neg:VBF_32_64
> +   (match_operand:VBF_32_64 1 "nonimmediate_operand"))
> + (match_operand:VBF_32_64   2 "nonimmediate_operand")
> + (neg:VBF_32_64
> +   (match_operand:VBF_32_64 3 "nonimmediate_operand"]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), 
> mode);
> +  rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), 
> mode);
> +  rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), 
> mode);
> +
> +  emit_insn (gen_fnmsv8bf4 (op0, op1, op2, op3));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
>  
>  ;;
>  ;; Parallel half-precision floating point complex type operations
> @@ -6670,8 +6752,6 @@
> (set_attr "modrm" "0")
> (set_attr "memory" "none")])
>
> -(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")])
> -
>  ;; VDIVNEPBF16 does not 

Re: [PATCH] i386: Fix vfpclassph non-optimizied intrin

2024-09-03 Thread Hongtao Liu
On Tue, Sep 3, 2024 at 2:24 PM Haochen Jiang  wrote:
>
> Hi all,
>
> The intrin for non-optimized got a typo in mask type, which will cause
> the high bits of __mmask32 being unexpectedly zeroed.
>
> The test does not fail under O0 with current 1b since the testcase is
> wrong. We need to include avx512-mask-type.h after SIZE is defined, or
> it will always be __mmask8. That problem also happened in AVX10.2 testcases.
> I will write a seperate patch to fix that.
>
> Bootstrapped and tested on x86-64-pc-linux-gnu. Ok for trunk?
Ok, please backport.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/avx512fp16intrin.h
> (_mm512_mask_fpclass_ph_mask): Correct mask type to __mmask32.
> (_mm512_fpclass_ph_mask): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512fp16-vfpclassph-1c.c: New test.
> ---
>  gcc/config/i386/avx512fp16intrin.h|  4 +-
>  .../i386/avx512fp16-vfpclassph-1c.c   | 77 +++
>  2 files changed, 79 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c
>
> diff --git a/gcc/config/i386/avx512fp16intrin.h 
> b/gcc/config/i386/avx512fp16intrin.h
> index 1869a920dd3..c3096b74ad2 100644
> --- a/gcc/config/i386/avx512fp16intrin.h
> +++ b/gcc/config/i386/avx512fp16intrin.h
> @@ -3961,11 +3961,11 @@ _mm512_fpclass_ph_mask (__m512h __A, const int __imm)
>  #else
>  #define _mm512_mask_fpclass_ph_mask(u, x, c)   \
>((__mmask32) __builtin_ia32_fpclassph512_mask ((__v32hf) (__m512h) (x), \
> -(int) (c),(__mmask8)(u)))
> +(int) (c),(__mmask32)(u)))
>
>  #define _mm512_fpclass_ph_mask(x, c)\
>((__mmask32) __builtin_ia32_fpclassph512_mask ((__v32hf) (__m512h) (x), \
> -(int) (c),(__mmask8)-1))
> +(int) (c),(__mmask32)-1))
>  #endif /* __OPIMTIZE__ */
>
>  /* Intrinsics vgetexpph.  */
> diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c 
> b/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c
> new file mode 100644
> index 000..4739f1228e3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c
> @@ -0,0 +1,77 @@
> +/* { dg-do run } */
> +/* { dg-options "-O0 -mavx512fp16" } */
> +/* { dg-require-effective-target avx512fp16 } */
> +
> +#define AVX512FP16
> +#include "avx512f-helper.h"
> +
> +#include 
> +#include 
> +#include 
> +#define SIZE (AVX512F_LEN / 16)
> +#include "avx512f-mask-type.h"
> +
> +#ifndef __FPCLASSPH__
> +#define __FPCLASSPH__
> +int check_fp_class_hp (_Float16 src, int imm)
> +{
> +  int qNaN_res = isnan (src);
> +  int sNaN_res = isnan (src);
> +  int Pzero_res = (src == 0.0);
> +  int Nzero_res = (src == -0.0);
> +  int PInf_res = (isinf (src) == 1);
> +  int NInf_res = (isinf (src) == -1);
> +  int Denorm_res = (fpclassify (src) == FP_SUBNORMAL);
> +  int FinNeg_res = __builtin_finite (src) && (src < 0);
> +
> +  int result = (((imm & 1) && qNaN_res)
> +   || (((imm >> 1) & 1) && Pzero_res)
> +   || (((imm >> 2) & 1) && Nzero_res)
> +   || (((imm >> 3) & 1) && PInf_res)
> +   || (((imm >> 4) & 1) && NInf_res)
> +   || (((imm >> 5) & 1) && Denorm_res)
> +   || (((imm >> 6) & 1) && FinNeg_res)
> +   || (((imm >> 7) & 1) && sNaN_res));
> +  return result;
> +}
> +#endif
> +
> +MASK_TYPE
> +CALC (_Float16 *s1, int imm)
> +{
> +  int i;
> +  MASK_TYPE res = 0;
> +
> +  for (i = 0; i < SIZE; i++)
> +if (check_fp_class_hp(s1[i], imm))
> +  res = res | (1 << i);
> +
> +  return res;
> +}
> +
> +void
> +TEST (void)
> +{
> +  int i;
> +  UNION_TYPE (AVX512F_LEN, h) src;
> +  MASK_TYPE res1, res2, res_ref = 0;
> +  MASK_TYPE mask = MASK_VALUE;
> +
> +  src.a[SIZE - 1] = NAN;
> +  src.a[SIZE - 2] = 1.0 / 0.0;
> +  for (i = 0; i < SIZE - 2; i++)
> +{
> +  src.a[i] = -24.43 + 0.6 * i;
> +}
> +
> +  res1 = INTRINSIC (_fpclass_ph_mask) (src.x, 0xFF);
> +  res2 = INTRINSIC (_mask_fpclass_ph_mask) (mask, src.x, 0xFF);
> +
> +  res_ref = CALC (src.a, 0xFF);
> +
> +  if (res_ref != res1)
> +abort ();
> +
> +  if ((mask & res_ref) != res2)
> +abort ();
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [r15-3359 Regression] FAIL: gcc.target/i386/avx10_2-bf-vector-cmpp-1.c (test for excess errors) on Linux/x86_64

2024-09-02 Thread Hongtao Liu
On Tue, Sep 3, 2024 at 9:45 AM Jiang, Haochen via Gcc-regression
 wrote:
>
> As each AVX10.2 testcases previously, this is caused by option combination 
> warning,
> which is expected.
>
Can we put the warning for mix usage of mavx10 and -mavx512f under -Wpsabi
And add -Wno-psabi in addition to -march=cascadelake to avoid the
false positive?

-- 
BR,
Hongtao


Re: [PATCH] i386: Support partial vectorized V2BF/V4BF smaxmin

2024-09-02 Thread Hongtao Liu
On Mon, Sep 2, 2024 at 4:42 PM Levy Hsu  wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
Ok.
>
> This patch supports sminmax for partial vectorized V2BF/V4BF.
>
> gcc/ChangeLog:
>
> * config/i386/mmx.md (3): New define_expand for 
> V2BF/V4BFsmaxmin
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c: New test.
> ---
>  gcc/config/i386/mmx.md| 19 ++
>  .../avx10_2-partial-bf-vector-smaxmin-1.c | 36 +++
>  2 files changed, 55 insertions(+)
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index 9116ddb5321..3f12a1349ab 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2098,6 +2098,25 @@
>DONE;
>  })
>
> +(define_expand "3"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +(smaxmin:VBF_32_64
> +  (match_operand:VBF_32_64 1 "nonimmediate_operand")
> +  (match_operand:VBF_32_64 2 "nonimmediate_operand")))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[1]), mode);
> +  rtx op2 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[2]), mode);
> +
> +  emit_insn (gen_v8bf3 (op0, op1, op2));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
>  (define_expand "sqrt2"
>[(set (match_operand:VHF_32_64 0 "register_operand")
> (sqrt:VHF_32_64
> diff --git 
> a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c
> new file mode 100644
> index 000..0a7cc58e29d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c
> @@ -0,0 +1,36 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mavx10.2 -Ofast" } */
> +/* /* { dg-final { scan-assembler-times "vmaxpbf16" 2 } } */
> +/* /* { dg-final { scan-assembler-times "vminpbf16" 2 } } */
> +
> +void
> +maxpbf16_64 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict 
> src2)
> +{
> +  int i;
> +  for (i = 0; i < 4; i++)
> +dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
> +}
> +
> +void
> +maxpbf16_32 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict 
> src2)
> +{
> +  int i;
> +  for (i = 0; i < 2; i++)
> +dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
> +}
> +
> +void
> +minpbf16_64 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict 
> src2)
> +{
> +  int i;
> +  for (i = 0; i < 4; i++)
> +dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
> +}
> +
> +void
> +minpbf16_32 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict 
> src2)
> +{
> +  int i;
> +  for (i = 0; i < 2; i++)
> +dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Support partial vectorized V2BF/V4BF plus/minus/mult/div/sqrt

2024-09-02 Thread Hongtao Liu
On Mon, Sep 2, 2024 at 4:33 PM Levy Hsu  wrote:
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> This patch introduces new mode iterators and expands for the i386 
> architecture to support partial vectorization of bf16 operations using 
> AVX10.2 instructions. These operations include addition, subtraction, 
> multiplication, division, and square root calculations for V2BF and V4BF data 
> types.
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/mmx.md (VBF_32_64): New mode iterator for partial 
> vectorized V2BF/V4BF.
> (3): New define_expand for plusminusmultdiv.
> (sqrt2): New define_expand for sqrt.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c: New test.
> * gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c: New test.
> ---
>  gcc/config/i386/mmx.md| 37 
>  .../avx10_2-partial-bf-vector-fast-math-1.c   | 22 +++
>  .../avx10_2-partial-bf-vector-operations-1.c  | 57 +++
>  3 files changed, 116 insertions(+)
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c
>
> diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> index e0065ed4d48..9116ddb5321 100644
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -94,6 +94,8 @@
>
>  (define_mode_iterator VHF_32_64 [V2HF (V4HF "TARGET_MMX_WITH_SSE")])
>
> +(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")])
> +
>  ;; Mapping from integer vector mode to mnemonic suffix
>  (define_mode_attr mmxvecsize
>[(V8QI "b") (V4QI "b") (V2QI "b")
> @@ -2036,6 +2038,26 @@
>DONE;
>  })
>
> +;; VDIVNEPBF16 does not generate floating point exceptions.
> +(define_expand "3"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +(plusminusmultdiv:VBF_32_64
> +  (match_operand:VBF_32_64 1 "nonimmediate_operand")
> +  (match_operand:VBF_32_64 2 "nonimmediate_operand")))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[1]), mode);
> +  rtx op2 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[2]), mode);
> +
> +  emit_insn (gen_v8bf3 (op0, op1, op2));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
>  (define_expand "divv2hf3"
>[(set (match_operand:V2HF 0 "register_operand")
> (div:V2HF
> @@ -2091,6 +2113,21 @@
>DONE;
>  })
>
> +(define_expand "sqrt2"
> +  [(set (match_operand:VBF_32_64 0 "register_operand")
> +   (sqrt:VBF_32_64 (match_operand:VBF_32_64 1 "vector_operand")))]
> +  "TARGET_AVX10_2_256"
> +{
> +  rtx op0 = gen_reg_rtx (V8BFmode);
> +  rtx op1 = lowpart_subreg (V8BFmode,
> +   force_reg (mode, operands[1]), mode);
> +
> +  emit_insn (gen_sqrtv8bf2 (op0, op1));
> +
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode));
> +  DONE;
> +})
> +
>  (define_expand "2"
>[(set (match_operand:VHF_32_64 0 "register_operand")
> (absneg:VHF_32_64
> diff --git 
> a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c
> new file mode 100644
> index 000..fd064f17445
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mavx10.2 -O2" } */
> +/* { dg-final { scan-assembler-times "vmulnepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> +/* { dg-final { scan-assembler-times "vrcppbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */
> +
> +typedef __bf16 v4bf __attribute__ ((__vector_size__ (8)));
> +typedef __bf16 v2bf __attribute__ ((__vector_size__ (4)));
> +
> +
> +__attribute__((optimize("fast-math")))
> +v4bf
> +foo_div_fast_math_4 (v4bf a, v4bf b)
> +{
> +  return a / b;
> +}
> +
> +__attribute__((optimize("fast-math")))
> +v2bf
> +foo_div_fast_math_2 (v2bf a, v2bf b)
> +{
> +  return a / b;
> +}
> diff --git 
> a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c
> new file mode 100644
> index 000..e7ee08a20a9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c
> @@ -0,0 +1,57 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mavx10.2 -O2" } */
> +/* { dg-final { scan-assembler-times "vmulnepbf16\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[
>  \\t\]+#)" 2 } } */
> +/* { dg-final { scan-assembler-times "vaddnepbf16\[

Re: [PATCH 0/8] i386: Opmitize code with AVX10.2 new instructions

2024-09-01 Thread Hongtao Liu
On Mon, Aug 26, 2024 at 2:43 PM Haochen Jiang  wrote:
>
> Hi all,
>
> I have just commited AVX10.2 new instructions patches into trunk hours
> ago. The next and final part for AVX10.2 upstream is to optimize code
> with AVX10.2 new instructions.
>
> In this patch series, it will contain the following optimizations:
>
>   - VNNI instruction auto vectorize (PATCH 1).
>   - Codegen optimization with new scalar comparison instructions to
> eliminate redundant code (PATCH 2-3).
>   - BF16 instruction auto vectorize (PATCH 4-8).
>
> This will finish the upstream for AVX10.2 series.
>
> Afterwards, we may add V2BF/V4BF in another thread just like what we
> have done for V2HF/V4HF when AVX512FP16 upstreamed.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?
Ok for all 8 patches.
>
> Thx,
> Haochen
>
>


-- 
BR,
Hongtao


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-25 Thread Hongtao Liu
On Fri, Aug 23, 2024 at 5:46 PM HAO CHEN GUI  wrote:
>
> Hi Hongtao,
>
> 在 2024/8/23 11:47, Hongtao Liu 写道:
> > On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI  wrote:
> >>
> >> Hi Hongtao,
> >>
> >> 在 2024/8/23 9:47, Hongtao Liu 写道:
> >>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI  
> >>> wrote:
> >>>>
> >>>> Hi Hongtao,
> >>>>
> >>>> 在 2024/8/21 11:21, Hongtao Liu 写道:
> >>>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> >>>>> rebase your patch and see if there's still the regressions.
> >>>>
> >>>> There's still regressions. The patch enables V16QI const0 store, but
> >>>> it also enables V8QI const0 store. The vector mode is preferable than
> >>>> scalar mode so that V8QI is used for 8-byte memory clear instead of
> >>>> DI. It's sub-optimal.
> >>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> >> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
> >>
> >> IMHO, could we disable it from predicate or convert it to DI mode store
> >> if V8QI const0 store is sub-optimal on i386?
> >>
> >>
> >>>>
> >>>> Another issue is it takes lots of subreg to generate an all-zero
> >>>> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> >>>> existing all-zero V16QI register.
> > Backend rtx_cost needs to be adjusted to prevent const0 propagation.
> > The current rtx_cost for const0 for i386 is 0, which will enable
> > propagation of const0.
> >
> >/* If MODE2 is appropriate for an MMX register, then tie
> > @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> > int outer_code_i, int opno,
> > case 0:
> >   break;
> > case 1:  /* 0: xor eliminates false dependency */
> > - *total = 0;
> > + /* Add extra cost 1 to prevent propagation of CONST_VECTOR
> > +for SET, which will enable more CSE optimization.  */
> > + *total = 0 + (outer_code == SET);
> >   return true;
> > default: /* -1: cmp contains false dependency */
> > - *total = 1;
> > + *total = 1 + (outer_code == SET);
> >   return true;
> > }
> >
> > the upper hunk should help for that.
> Sorry, I didn't get your point. Which problem it will fix? I tested
> upper code. Nothing changed. Which kind of const0 propagation you want
> to prevent?
The patch itself doesn't enable CSE for const0_rtx, but it's needed
after cse_insn recognizes CONST0_RTX with a different mode and
replaces them with subreg.
I thought you had changed the cse_insn part.
 On the other hand, pxor is cheap, what matters more is the CSE of
broadcasting the same value to different modes. i.e.

__m512i sinkz;
__m256i sinky;
void foo(char c) {
sinkz = _mm512_set1_epi8(c);
sinky = _mm256_set1_epi8(c);
}

>
> Thanks
> Gui Haochen
>
> >>>>
> >>>> (insn 16 15 17 (set (reg:V4SI 118)
> >>>> (const_vector:V4SI [
> >>>> (const_int 0 [0]) repeated x4
> >>>> ])) "auto-init-7.c":25:12 -1
> >>>>  (nil))
> >>>>
> >>>> (insn 17 16 18 (set (reg:V8HI 117)
> >>>> (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
> >>>>  (nil))
> >>>>
> >>>> (insn 18 17 19 (set (reg:V16QI 116)
> >>>> (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
> >>>>  (nil))
> >>>>
> >>>> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
> >>>> (const_int 12 [0xc])) [0 MEM  [(void 
> >>>> *)&temp3]+12 S16 A32])
> >>>> (reg:V16QI 116)) "auto-init-7.c":25:12 -1
> >>>>  (nil))
> >>> I think those subregs can be simplified by later rtl passes?
> >>
> >> Here is the final dump. There are two all-zero 16-byte vector
> >> registers. It can't figure out V4SI could be a subreg of V16QI.
> >>
> >> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
> >> (const_vector:V16QI [
> >> (const_int 0 [0]) repeated x16
> >> ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
> >>  (nil))
> >> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM  
> >> [(void *)&temp3]+0 S16 A128])
> >> (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 
> >> {movv16qi_internal}
> >>  (nil))
> >> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
> >> (const_vector:V4SI [
> >> (const_int 0 [0]) repeated x4
> >> ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
> >>  (nil))
> >> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
> >> (const_int 12 [0xc])) [0 MEM  [(void 
> >> *)&temp3]+12 S16 A32])
> >> (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 
> >> {movv16qi_internal}
> >>
> >> Thanks
> >> Gui Haochen
> >>
> >>>>
> >>>> Thanks
> >>>> Gui Haochen
> >>>
> >>>
> >>>
> >
> >
> >



-- 
BR,
Hongtao


Re: [PATCH 00/12] AVX10.2: Support new instructions

2024-08-25 Thread Hongtao Liu
On Mon, Aug 19, 2024 at 4:57 PM Haochen Jiang  wrote:
>
> Hi all,
>
> The AVX10.2 ymm rounding patches has been merged to trunk around
> 6 hours ago. As mentioned before, next step will be AVX10.2 new
> instruction support.
>
> This patch series could be divided into three part.
>
> The first patch will refactor m512-check.h under testsuite to reuse
> AVX-512 helper functions and unions and avoid ABI warnings when using
> AVX10.
>
> The following ten patches will support all AVX10.2 new instrctions,
> including:
>
>   - AI Datatypes, Conversions, and post-Convolution Instructions.
>   - Media Acceleration.
>   - IEEE-754-2019 Minimum and Maximum Support.
>   - Saturating Conversions.
>   - Zero-extending Partial Vector Copies.
>   - FP Scalar Comparison.
>
> For FP Scalar Comparison part (a.k.a comx instructions), we will only
> provide pattern support but not intrin support since it is redundant
> with comi ones for common usage. We will also add some optimizations
> afterwards for common usage with comx instructions. If there are some
> strong requests, we will add intrin support in the future.
>
> The final patch will add bf8 -> fp16 intrin for convenience. Since the
> conversion from bf8 to fp16 is only casting for fraction part due to
> same bits for exponent part, we will use a sequence of instructions
> instead of new instructions. It is just like the scenario for bf16 ->
> fp32 conversion.
>
> After all these patch merged, the next step would be optimizations based
> on AVX10.2 new instructions, including vnni vectorization, bf16
> vectorization, comx optmization, etc.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?
Ok for all 12 patches.
>
> Thx,
> Haochen
>


-- 
BR,
Hongtao


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-22 Thread Hongtao Liu
On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI  wrote:
>
> Hi Hongtao,
>
> 在 2024/8/23 9:47, Hongtao Liu 写道:
> > On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI  wrote:
> >>
> >> Hi Hongtao,
> >>
> >> 在 2024/8/21 11:21, Hongtao Liu 写道:
> >>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> >>> rebase your patch and see if there's still the regressions.
> >>
> >> There's still regressions. The patch enables V16QI const0 store, but
> >> it also enables V8QI const0 store. The vector mode is preferable than
> >> scalar mode so that V8QI is used for 8-byte memory clear instead of
> >> DI. It's sub-optimal.
> > Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
>
> IMHO, could we disable it from predicate or convert it to DI mode store
> if V8QI const0 store is sub-optimal on i386?
>
>
> >>
> >> Another issue is it takes lots of subreg to generate an all-zero
> >> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> >> existing all-zero V16QI register.
Backend rtx_cost needs to be adjusted to prevent const0 propagation.
The current rtx_cost for const0 for i386 is 0, which will enable
propagation of const0.

   /* If MODE2 is appropriate for an MMX register, then tie
@@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
int outer_code_i, int opno,
case 0:
  break;
case 1:  /* 0: xor eliminates false dependency */
- *total = 0;
+ /* Add extra cost 1 to prevent propagation of CONST_VECTOR
+for SET, which will enable more CSE optimization.  */
+ *total = 0 + (outer_code == SET);
  return true;
default: /* -1: cmp contains false dependency */
- *total = 1;
+ *total = 1 + (outer_code == SET);
  return true;
}

the upper hunk should help for that.
> >>
> >> (insn 16 15 17 (set (reg:V4SI 118)
> >> (const_vector:V4SI [
> >> (const_int 0 [0]) repeated x4
> >> ])) "auto-init-7.c":25:12 -1
> >>  (nil))
> >>
> >> (insn 17 16 18 (set (reg:V8HI 117)
> >> (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
> >>  (nil))
> >>
> >> (insn 18 17 19 (set (reg:V16QI 116)
> >> (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
> >>  (nil))
> >>
> >> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
> >> (const_int 12 [0xc])) [0 MEM  [(void 
> >> *)&temp3]+12 S16 A32])
> >> (reg:V16QI 116)) "auto-init-7.c":25:12 -1
> >>  (nil))
> > I think those subregs can be simplified by later rtl passes?
>
> Here is the final dump. There are two all-zero 16-byte vector
> registers. It can't figure out V4SI could be a subreg of V16QI.
>
> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
> (const_vector:V16QI [
> (const_int 0 [0]) repeated x16
> ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>  (nil))
> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM  
> [(void *)&temp3]+0 S16 A128])
> (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 
> {movv16qi_internal}
>  (nil))
> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
> (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
>  (nil))
> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
> (const_int 12 [0xc])) [0 MEM  [(void *)&temp3]+12 
> S16 A32])
> (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 
> {movv16qi_internal}
>
> Thanks
> Gui Haochen
>
> >>
> >> Thanks
> >> Gui Haochen
> >
> >
> >



-- 
BR,
Hongtao


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-22 Thread Hongtao Liu
On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI  wrote:
>
> Hi Hongtao,
>
> 在 2024/8/21 11:21, Hongtao Liu 写道:
> > r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> > rebase your patch and see if there's still the regressions.
>
> There's still regressions. The patch enables V16QI const0 store, but
> it also enables V8QI const0 store. The vector mode is preferable than
> scalar mode so that V8QI is used for 8-byte memory clear instead of
> DI. It's sub-optimal.
Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
>
> Another issue is it takes lots of subreg to generate an all-zero
> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> existing all-zero V16QI register.
>
> (insn 16 15 17 (set (reg:V4SI 118)
> (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])) "auto-init-7.c":25:12 -1
>  (nil))
>
> (insn 17 16 18 (set (reg:V8HI 117)
> (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
>  (nil))
>
> (insn 18 17 19 (set (reg:V16QI 116)
> (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
>  (nil))
>
> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
> (const_int 12 [0xc])) [0 MEM  [(void *)&temp3]+12 
> S16 A32])
> (reg:V16QI 116)) "auto-init-7.c":25:12 -1
>  (nil))
I think those subregs can be simplified by later rtl passes?
>
> Thanks
> Gui Haochen



-- 
BR,
Hongtao


Re: [PATCH] Align ix86_{move_max,store_max} with vectorizer.

2024-08-21 Thread Hongtao Liu
On Wed, Aug 21, 2024 at 4:49 PM Richard Biener
 wrote:
>
> On Wed, Aug 21, 2024 at 7:40 AM liuhongt  wrote:
> >
> > When none of mprefer-vector-width, avx256_optimal/avx128_optimal,
> > avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will
> > set ix86_{move_max,store_max} as max available vector length except
> > for AVX part.
> >
> >   if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
> >   && TARGET_EVEX512_P (opts->x_ix86_isa_flags2))
> > opts->x_ix86_move_max = PVW_AVX512;
> >   else
> > opts->x_ix86_move_max = PVW_AVX128;
> >
> > So for -mavx2, vectorizer will choose 256-bit for vectorization, but
> > 128-bit is used for struct copy, there could be a potential STLF issue
> > due to this "misalign".
> >
> > The patch fixes that and improved 538.imagick_r by ~30% for 
> > -march=x86-64-v3 -O2.
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Any comments?
>
> Should we look at the avx128_optimal tune and/or avx256_split_regs and
> avx256_optimal
> also for 512?  Because IIRC the vectorizers default looks at that as
> well (OTOH larger
> stores should be fine for STLF).
For Double Pumped processors, i.e. SRF, there's no STLF issue for
128-bit store and 256-bit load since the 256-bit load is teared down
to 2 128-bit load.
I guess it should be similar for Znver1/Znve4, so it should be fine
with the mismatch between struct copy and vectorizer size.
One exception is that we use 256-bit for vectorization and 512-bit for
struct copy on SPR, it could be an issue when the struct copy is after
the vectorization.
But I didn't observe any cases yet, and for not-STLF-stall case,
512-bit copy should be better than 256-bit copy on SPR, So I'll leave
it there.(There's a plan to enable 512-bit vectorization for SPR by
default, it's ongoing).
>
> > gcc/ChangeLog:
> >
> > * config/i386/i386-options.cc (ix86_option_override_internal):
> > set ix86_{move_max,store_max} to PVW_AVX256 when TARGET_AVX
> > instead of PVW_AVX128.
> >
> > gcc/testsuite/ChangeLog:
> > * gcc.target/i386/pieces-memcpy-10.c: Add -mprefer-vector-width=128.
> > * gcc.target/i386/pieces-memcpy-6.c: Ditto.
> > * gcc.target/i386/pieces-memset-38.c: Ditto.
> > * gcc.target/i386/pieces-memset-40.c: Ditto.
> > * gcc.target/i386/pieces-memset-41.c: Ditto.
> > * gcc.target/i386/pieces-memset-42.c: Ditto.
> > * gcc.target/i386/pieces-memset-43.c: Ditto.
> > * gcc.target/i386/pieces-strcpy-2.c: Ditto.
> > * gcc.target/i386/pieces-memcpy-22.c: New test.
> > * gcc.target/i386/pieces-memset-51.c: New test.
> > * gcc.target/i386/pieces-strcpy-3.c: New test.
> > ---
> >  gcc/config/i386/i386-options.cc  |  6 ++
> >  gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memcpy-22.c | 12 
> >  gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c  |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-38.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-40.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-41.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-42.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-43.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-memset-51.c | 12 
> >  gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c  |  2 +-
> >  gcc/testsuite/gcc.target/i386/pieces-strcpy-3.c  | 15 +++
> >  12 files changed, 53 insertions(+), 8 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-22.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-51.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-3.c
> >
> > diff --git a/gcc/config/i386/i386-options.cc 
> > b/gcc/config/i386/i386-options.cc
> > index f423455b363..f79257cc764 100644
> > --- a/gcc/config/i386/i386-options.cc
> > +++ b/gcc/config/i386/i386-options.cc
> > @@ -3023,6 +3023,9 @@ ix86_option_override_internal (bool main_args_p,
> >   if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
> >   && TARGET_EVEX512_P (opts->x_ix86_isa_flags2))
> > opts->x_ix86_move_max = PVW_AVX512;
> > + /* Align with vectorizer to avoid potential STLF issue.  */
> > + else if (TARGET_AVX_P (opts->x_ix86_isa_flags))
> > +   opts->x_ix86_move_max = PVW_AVX256;
> >   else
> > opts->x_ix86_move_max = PVW_AVX128;
> > }
> > @@ -3047,6 +3050,9 @@ ix86_option_override_internal (bool main_args_p,
> >   if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
> >   && TARGET_EVEX512_P (opts->x_ix86_isa_flags2))
> > opts->x_ix86_store_max = PVW_AVX512;
> > + /* Align with vectorizer to avoid potential STLF issue.  */
> > + else if (TARGET_AVX_P (opts->x_ix86_isa_

Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-20 Thread Hongtao Liu
On Tue, Aug 20, 2024 at 2:50 PM Hongtao Liu  wrote:
>
> On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI  wrote:
> >
> > Hi,
> >   Add Hongtao Liu as the patch affects x86.
> >
> > 在 2024/8/20 6:32, Richard Sandiford 写道:
> > > HAO CHEN GUI  writes:
> > >> Hi,
> > >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> > >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> > >> constant. So even a platform doesn't support vec_duplicate, it could
> > >> still do clear by pieces if it supports const0 move by that mode.
> > >>
> > >>   Compared to the previous version, the main change is to set up a
> > >> new function to generate const0 for certain modes and use the function
> > >> as by_pieces_constfn for CLEAR_BY_PIECES.
> > >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> > >>
> > >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> > >> regressions.
> > >>
> > >>   On i386, it got several regressions. One issue is the predicate of
> > >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> > >> for clear by pieces with the patch. The second issue is the const0 is
> > >> passed directly to the move expand with the patch. Originally it is
> > >> forced to a pseudo and i386 can leverage the previous data to do
> > >> optimization.
> > >
> > > The patch looks good to me, but I suppose we'll need to decide what
> > > to do about x86.
> > >
> > > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > > source, especially since ix86_expand_vector_mode does have code to
> > > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > > predicates anyway, so the predicate will have little effect.
> > >
> > > A workaround would be to check legitimate_constant_p instead of the
> > > predicate, but I'm not sure that that should be necessary.
> > >
> > > Has this already been discussed?  If not, we should loop in the x86
> > > maintainers (but I didn't do that here in case it would be a repeat).
> >
> > I also noticed it. Not sure why movv16qi requires a
> > nonimmediate_operand, while ix86_expand_vector_mode could deal with
> > constant op. Looking forward to Hongtao's comments.
> The code has been there since 2005 before I'm involved.
>  It looks to me at the beginning both mov and
> *mov_internal only support nonimmediate_operand for the
> operands[1].
> And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
> nonimmediate_or_sse_const_operand for *mov_internal, but not for
> mov.
> I think we can align the predicate between mov and *mov_internal.
> I'll do some tests and reach back to you.
r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
rebase your patch and see if there's still the regressions.
> >
> > >
> > > As far as the second issue goes, I suppose there are at least three
> > > ways of handling shared constants:
> > >
> > > (1) Force the zero into a register and leave later optimisations to
> > > propagate the zero where profitable.
> > The zero can be propagated into the store, but the address adjustment
> > may not be combined into insn properly. For instance, if zero is
> > forced to a register, "movv2x8qi" insn is generated. The address
> > adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> > d-from address. When zero is propagated, it converts "movv2x8qi" to
> > "movti". "movti" supports d-from as well as post/inc address. Probably,
> > the auto_inc_dec pass combines address adjustment insn into previous
> > "movti" to generate a post inc "movti". The expected optimization might
> > be to combine address adjustment insn into second "movit" and generate a
> > d-form "movti". It's a regression issue I found in aarch64.
> >
> > Also we checks if const0 is supported for mov optab. But finally we
> > force the const0 to a register and generate a store with the register.
> > Seems it's not reasonable.
> >
> > >
> > > (2) Emit stores of zero and expect a later pass to share constants
> > > where beneficial.
> > Not sure which pass can optimize it.
> >
> > >
> > > (3) Generate stores of zero and leave t

Re: [PATCH] Align predicates for operands[1] between mov and *mov_internal.

2024-08-20 Thread Hongtao Liu
On Tue, Aug 20, 2024 at 6:25 PM liuhongt  wrote:
>
> From [1]
[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660575.html

> > > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > > source, especially since ix86_expand_vector_mode does have code to
> > > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > > predicates anyway, so the predicate will have little effect.
> > >
> > > A workaround would be to check legitimate_constant_p instead of the
> > > predicate, but I'm not sure that that should be necessary.
> > >
> > > Has this already been discussed?  If not, we should loop in the x86
> > > maintainers (but I didn't do that here in case it would be a repeat).
> >
> > I also noticed it. Not sure why movv16qi requires a
> > nonimmediate_operand, while ix86_expand_vector_mode could deal with
> > constant op. Looking forward to Hongtao's comments.
> The code has been there since 2005 before I'm involved.
>  It looks to me at the beginning both mov and
> *mov_internal only support nonimmediate_operand for the
> operands[1].
> And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
> nonimmediate_or_sse_const_operand for *mov_internal, but not for
> mov. I think we can align the predicate between mov
> and *mov_internal.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (mov): Align predicates for
> operands[1] between mov and *mov_internal.
> ---
>  gcc/config/i386/sse.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index d1010bc5682..7ecfbd55809 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1387,7 +1387,7 @@ (define_mode_attr DOUBLEMASKMODE
>
>  (define_expand "mov"
>[(set (match_operand:VMOVE 0 "nonimmediate_operand")
> -   (match_operand:VMOVE 1 "nonimmediate_operand"))]
> +   (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"))]
>"TARGET_SSE"
>  {
>ix86_expand_vector_move (mode, operands);
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-19 Thread Hongtao Liu
On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI  wrote:
>
> Hi,
>   Add Hongtao Liu as the patch affects x86.
>
> 在 2024/8/20 6:32, Richard Sandiford 写道:
> > HAO CHEN GUI  writes:
> >> Hi,
> >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> >> constant. So even a platform doesn't support vec_duplicate, it could
> >> still do clear by pieces if it supports const0 move by that mode.
> >>
> >>   Compared to the previous version, the main change is to set up a
> >> new function to generate const0 for certain modes and use the function
> >> as by_pieces_constfn for CLEAR_BY_PIECES.
> >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> >>
> >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> >> regressions.
> >>
> >>   On i386, it got several regressions. One issue is the predicate of
> >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> >> for clear by pieces with the patch. The second issue is the const0 is
> >> passed directly to the move expand with the patch. Originally it is
> >> forced to a pseudo and i386 can leverage the previous data to do
> >> optimization.
> >
> > The patch looks good to me, but I suppose we'll need to decide what
> > to do about x86.
> >
> > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > source, especially since ix86_expand_vector_mode does have code to
> > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > predicates anyway, so the predicate will have little effect.
> >
> > A workaround would be to check legitimate_constant_p instead of the
> > predicate, but I'm not sure that that should be necessary.
> >
> > Has this already been discussed?  If not, we should loop in the x86
> > maintainers (but I didn't do that here in case it would be a repeat).
>
> I also noticed it. Not sure why movv16qi requires a
> nonimmediate_operand, while ix86_expand_vector_mode could deal with
> constant op. Looking forward to Hongtao's comments.
The code has been there since 2005 before I'm involved.
 It looks to me at the beginning both mov and
*mov_internal only support nonimmediate_operand for the
operands[1].
And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
nonimmediate_or_sse_const_operand for *mov_internal, but not for
mov.
I think we can align the predicate between mov and *mov_internal.
I'll do some tests and reach back to you.
>
> >
> > As far as the second issue goes, I suppose there are at least three
> > ways of handling shared constants:
> >
> > (1) Force the zero into a register and leave later optimisations to
> > propagate the zero where profitable.
> The zero can be propagated into the store, but the address adjustment
> may not be combined into insn properly. For instance, if zero is
> forced to a register, "movv2x8qi" insn is generated. The address
> adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> d-from address. When zero is propagated, it converts "movv2x8qi" to
> "movti". "movti" supports d-from as well as post/inc address. Probably,
> the auto_inc_dec pass combines address adjustment insn into previous
> "movti" to generate a post inc "movti". The expected optimization might
> be to combine address adjustment insn into second "movit" and generate a
> d-form "movti". It's a regression issue I found in aarch64.
>
> Also we checks if const0 is supported for mov optab. But finally we
> force the const0 to a register and generate a store with the register.
> Seems it's not reasonable.
>
> >
> > (2) Emit stores of zero and expect a later pass to share constants
> > where beneficial.
> Not sure which pass can optimize it.
>
> >
> > (3) Generate stores of zero and leave the target expanders to force
> > constants into registers on the fly if reuse seems plausibly
> > beneficial.
> >
> The constant zero with different modes are not relevant. Not sure
> which pass can optimize it. The compiler should be taught that
> reg 102 can be expressed as a subreg of reg 100.
>
> (insn 6 5 7 2 (set (reg:V32QI 100)
> (const_vector:V32QI [
> (const_int 0 [0]) repeated x32
> ]))
>
> (insn 8 7 0 2 (set (reg:V16QI 102)
> (const_vector:V16QI [
> (const_int 0 [0]) r

Re: [PATCH 00/22] Support AVX10.2 ymm rounding

2024-08-18 Thread Hongtao Liu
On Wed, Aug 14, 2024 at 5:07 PM Haochen Jiang  wrote:
>
> Hi all,
>
> The initial patch for AVX10.2 has been merged this week.
>
> For the upcoming patches, we will first upstream ymm rounding control part.
>
> In ymm rounding part, ALL the instructions in AVX512 with 512-bit rounding
> control will also have 256-bit rounding control in AVX10.2.
>
> For clearness, the patch order is based on alphabetical order. Each patch
> will include its intrin definition and related tests. Sometimes pattern is
> not changed in the patch because the previous change in the patch series
> has already enabled the 256 bit rounding in the pattern.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?
Ok for all 22 patches in the thread.
>
> Thx,
> Haochen
>
> Ref: Intel Advanced Vector Extensions 10.2 Architecture Specification
> https://cdrdv2.intel.com/v1/dl/getContent/828965
>
>


-- 
BR,
Hongtao


Re: [PATCH v2] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-15 Thread Hongtao Liu
On Thu, Aug 15, 2024 at 3:27 PM liuhongt  wrote:
>
> It results in 2 failures for x86_64-pc-linux-gnu{\
> -march=cascadelake};
>
> gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
> gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
>
> For pr113560.c, now GCC generates mulx instead of mulq with
> -march=cascadelake, which should be optimal, so adjust testcase for
> that.
> For gcc.target/i386/extendditi2-1.c, RA happens to choose another
> register instead of rax and result in
>
> movq%rdi, %rbp
> movq%rdi, %rax
> sarq$63, %rbp
> movq%rbp, %rdx
>
> The patch adds a new define_peephole2 for that.
>
> gcc/ChangeLog:
>
> PR target/116274
> * config/i386/i386-expand.cc (ix86_expand_vector_move):
> Restrict special case TImode to 128-bit vector conversions via
> V2DI under ix86_pre_reload_split ().
> * config/i386/i386.cc (inline_secondary_memory_needed):
> Movement between GENERAL_REGS and SSE_REGS for TImode doesn't
> need secondary reload.
> * config/i386/i386.md (*extendsidi2_rex64): Add a
> define_peephole2 after it.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116274.c: New test.
> * gcc.target/i386/pr113560.c: Scan either mulq or mulx.
> ---
>  gcc/config/i386/i386-expand.cc   |  2 +-
>  gcc/config/i386/i386.cc  | 18 --
>  gcc/config/i386/i386.md  | 19 +++
>  gcc/testsuite/gcc.target/i386/pr113560.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116274.c | 12 
>  5 files changed, 45 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index bdbc1423267..ed546eeed6b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -751,7 +751,7 @@ ix86_expand_vector_move (machine_mode mode, rtx 
> operands[])
>&& SUBREG_P (op1)
>&& GET_MODE (SUBREG_REG (op1)) == TImode
>&& TARGET_64BIT && TARGET_SSE
> -  && can_create_pseudo_p ())
> +  && ix86_pre_reload_split ())
>  {
>rtx tmp = gen_reg_rtx (V2DImode);
>rtx lo = gen_reg_rtx (DImode);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index f044826269c..4821892d1e0 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -20292,6 +20292,18 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2)))
> return true;
>
> +  /* If the target says that inter-unit moves are more expensive
> +than moving through memory, then don't generate them.  */
> +  if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
> + || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
> +   return true;
> +
> +  /* Under SSE4.1, *movti_internal supports movement between
> +SSE_REGS and GENERAL_REGS with pinsrq and pextrq.  */
> +  if (TARGET_SSE4_1
> + && (TARGET_64BIT ? mode == TImode : mode == DImode))
> +   return false;
> +
>int msize = GET_MODE_SIZE (mode);
>
>/* Between SSE and general, we have moves no larger than word size.  */
> @@ -20304,12 +20316,6 @@ inline_secondary_memory_needed (machine_mode mode, 
> reg_class_t class1,
>
>if (msize < minsize)
> return true;
> -
> -  /* If the target says that inter-unit moves are more expensive
> -than moving through memory, then don't generate them.  */
> -  if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC)
> - || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC))
> -   return true;
>  }
>
>return false;
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index db7789c17d2..1962a7ba5c9 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5041,6 +5041,25 @@ (define_split
>DONE;
>  })
>
> +(define_peephole2
> +  [(set (match_operand:DI 0 "general_reg_operand")
> +   (match_operand:DI 1 "general_reg_operand"))
> +   (parallel [(set (match_dup 0)
> +  (ashiftrt:DI (match_dup 0)
> +   (const_int 63)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (set (match_operand:DI 2 "general_reg_operand") (match_dup 1))
> +   (set (match_operand:DI 3 "general_reg_operand") (match_dup 0))]
> +  "(optimize_function_for_size_p (cfun) || TARGET_USE_CLTD)
> +   && REGNO (operands[2]) == AX_REG
> +   && REGNO (operands[3]) == DX_REG
> +   && peep2_reg_dead_p (4, operands[0])
> +   && !reg_mentioned_p (operands[0], operands[1])
> +   && !reg_mentioned_p (operands[2], operands[0])"
> +  [(set (match_dup 2) (match_dup 1))
> +   (parallel [(set (match_dup 3) (ashiftrt:DI (match_dup 2) (const_int 63)))
> + (clobber (reg:CC FLAGS

Re: [PATCH v2] i386: Fix some vex insns that prohibit egpr

2024-08-14 Thread Hongtao Liu
On Wed, Aug 14, 2024 at 4:23 PM Kong, Lingling  wrote:
>
>
>
> -Original Message-
> From: Kong, Lingling 
> Sent: Wednesday, August 14, 2024 4:20 PM
> To: Kong, Lingling 
> Subject: [PATCH v2] i386: Fix some vex insns that prohibit egpr
>
> Although these vex insn have evex counterpart, but when it uses the displayed 
> vex prefix should not support APX EGPR.
> Like TARGET_AVXVNNI, TARGET_IFMA and TARGET_AVXNECONVERT.
> TARGET_AVXVNNIINT8 and TARGET_AVXVNNITINT16 are also vex insn should not 
> support egpr.
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (vpmadd52):
> Prohibit egpr for vex version.
> (vpdpbusd_): Ditto.
> (vpdpbusds_): Ditto.
> (vpdpwssd_): Ditto.
> (vpdpwssds_): Ditto.
> (*vcvtneps2bf16_v4sf): Ditto.
> (vcvtneps2bf16_v8sf): Ditto.
> (vpdp_): Ditto.
> (vbcstnebf162ps_): Ditto.
> (vbcstnesh2ps_): Ditto.
> (vcvtnee2ps_): Ditto.
> (vcvtneo2ps_): Ditto.
> (vpdp_): Ditto.
> ---
>  gcc/config/i386/sse.md | 49 +++---
>  1 file changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 
> d1010bc5682..f0d94bba4e7 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -29886,7 +29886,7 @@
> (unspec:VI8_AVX2
>   [(match_operand:VI8_AVX2 1 "register_operand" "0,0")
>(match_operand:VI8_AVX2 2 "register_operand" "x,v")
> -  (match_operand:VI8_AVX2 3 "nonimmediate_operand" "xm,vm")]
> +  (match_operand:VI8_AVX2 3 "nonimmediate_operand" "xjm,vm")]
>   VPMADD52))]
>"TARGET_AVXIFMA || (TARGET_AVX512IFMA && TARGET_AVX512VL)"
>"@
> @@ -29894,6 +29894,7 @@
>vpmadd52\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr "isa" "avxifma,avx512ifmavl")
> (set_attr "type" "ssemuladd")
> +   (set_attr "addr" "gpr16,*")
> (set_attr "prefix" "vex,evex")
> (set_attr "mode" "")])
>
> @@ -30253,13 +30254,14 @@
> (unspec:VI4_AVX2
>   [(match_operand:VI4_AVX2 1 "register_operand" "0,0")
>(match_operand:VI4_AVX2 2 "register_operand" "x,v")
> -  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")]
> +  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")]
>   UNSPEC_VPDPBUSD))]
>"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)"
>"@
>%{vex%} vpdpbusd\t{%3, %2, %0|%0, %2, %3}
>vpdpbusd\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr ("prefix") ("vex,evex"))
> +   (set_attr "addr" "gpr16,*")
> (set_attr ("isa") ("avxvnni,avx512vnnivl"))])
>
>  (define_insn "vpdpbusd__mask"
> @@ -30321,13 +30323,14 @@
> (unspec:VI4_AVX2
>   [(match_operand:VI4_AVX2 1 "register_operand" "0,0")
>(match_operand:VI4_AVX2 2 "register_operand" "x,v")
> -  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")]
> +  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")]
>   UNSPEC_VPDPBUSDS))]
>"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)"
>"@
> %{vex%} vpdpbusds\t{%3, %2, %0|%0, %2, %3}
> vpdpbusds\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr ("prefix") ("vex,evex"))
> +   (set_attr "addr" "gpr16,*")
> (set_attr ("isa") ("avxvnni,avx512vnnivl"))])
>
>  (define_insn "vpdpbusds__mask"
> @@ -30389,13 +30392,14 @@
> (unspec:VI4_AVX2
>   [(match_operand:VI4_AVX2 1 "register_operand" "0,0")
>(match_operand:VI4_AVX2 2 "register_operand" "x,v")
> -  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")]
> +  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")]
>   UNSPEC_VPDPWSSD))]
>"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)"
>"@
>%{vex%} vpdpwssd\t{%3, %2, %0|%0, %2, %3}
>vpdpwssd\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr ("prefix") ("vex,evex"))
> +   (set_attr "addr" "gpr16,*")
> (set_attr ("isa") ("avxvnni,avx512vnnivl"))])
>
>  (define_insn "vpdpwssd__mask"
> @@ -30457,13 +30461,14 @@
> (unspec:VI4_AVX2
>   [(match_operand:VI4_AVX2 1 "register_operand" "0,0")
>(match_operand:VI4_AVX2 2 "register_operand" "x,v")
> -  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")]
> +  (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")]
>   UNSPEC_VPDPWSSDS))]
>"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)"
>"@
>%{vex%} vpdpwssds\t{%3, %2, %0|%0, %2, %3}
>vpdpwssds\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr ("prefix") ("vex,evex"))
> +   (set_attr "addr" "gpr16,*")
> (set_attr ("isa") ("avxvnni,avx512vnnivl"))])
>
>  (define_insn "vpdpwssds__mask"
> @@ -30681,13 +30686,14 @@
>[(set (match_operand:V8BF 0 "register_operand" "=x,v")
> (vec_concat:V8BF
>   (float_truncate:V4BF
> -   (match_operand:V4SF 1 "nonimmediate_operand" "xm,vm"))
> +   (match_operand:V4SF 1 "nonimmediate_o

Re: [PATCH 4/4] i386: Optimization for APX NDD is always zero-uppered for shift

2024-08-13 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 3:12 PM kong lingling  wrote:
>
> gcc/ChangeLog:
>
>
> PR target/113729
>
>* config/i386/i386.md (*ashlqi3_1_zext):
>
>New define_insn.
>
>(*ashlhi3_1_zext): Ditto.
>
>(*qi3_1_zext): Ditto.
>
>(*hi3_1_zext): Ditto.
>
>(*qi3_1_zext): Ditto.
>
>(*hi3_1_zext): Ditto.
>
>
>
> gcc/testsuite/ChangeLog:
>
>
>
>* gcc.target/i386/pr113729.c: Add testcase for shift and 
> rotate.

Ok.



-- 
BR,
Hongtao


Re: [PATCH 3/4] i386: Optimization for APX NDD is always zero-uppered for logic

2024-08-13 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 3:12 PM kong lingling  wrote:
>
> gcc/ChangeLog:
>
>
>PR target/113729
>
>* config/i386/i386.md (*andqi_1_zext):
>
>New define_insn.
>
>(*andhi_1_zext): Ditto.
>
>(*qi_1_zext): Ditto.
>
>(*hi_1_zext): Ditto.
>
>(*negqi_1_zext): Ditto.
>
>(*neghi_1_zext): Ditto.
>
>(*one_cmplqi2_1_zext): Ditto.
>
>(*one_cmplhi2_1_zext): Ditto.
>
>
>
> gcc/testsuite/ChangeLog:
>
>
>
>* gcc.target/i386/pr113729.c: Add new test for logic.

Ok.


-- 
BR,
Hongtao


Re: [PATCH 2/4] i386: Optimization for APX NDD is always zero-uppered for sub/adc/sbb

2024-08-13 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 3:12 PM kong lingling  wrote:
>
> gcc/ChangeLog:
>
>
>
>PR target/113729
>
>* config/i386/i386.md (*subqi_1_zext): New
>
>define_insn.
>
>(*subhi_1_zext): Ditto.
>
>(*addqi3_carry_zext): Ditto.
>
>(*addhi3_carry_zext): Ditto.
>
>(*addqi3_carry_zext_0): Ditto.
>
>(*addhi3_carry_zext_0): Ditto.
>
>(*addqi3_carry_zext_0r): Ditto.
>
>(*addhi3_carry_zext_0r): Ditto.
>
>(*subqi3_carry_zext): Ditto.
>
>(*subhi3_carry_zext): Ditto.
>
>(*subqi3_carry_zext_0): Ditto.
>
>(*subhi3_carry_zext_0): Ditto.
>
>(*subqi3_carry_zext_0r): Ditto.
>
>(*subhi3_carry_zext_0r): Ditto.
>
>
>
> gcc/testsuite/ChangeLog:
>
>
>
>* gcc.target/i386/pr113729.c: Add test for sub.
>
>* gcc.target/i386/pr113729-adc-sbb.c: New test.
>
Ok.


-- 
BR,
Hongtao


Re: [PATCH 1/4] i386: Optimization for APX NDD is always zero-uppered for ADD

2024-08-13 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 3:10 PM kong lingling  wrote:
>
> For APX instruction with an NDD, the destination GPR will get the 
> instruction’s result in bits [OSIZE-1:0] and, if OSIZE < 64b, have its upper 
> bits [63:OSIZE] zeroed. Now supporting other NDD instructions.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>
> Ok for trunk?

Ok.


-- 
BR,
Hongtao


Re: [PATCH] Move ix86_align_loops into a separate pass and insert the pass after pass_endbr_and_patchable_area.

2024-08-13 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 10:10 PM liuhongt  wrote:
>
> > Are there any assumptions that BB_HEAD must be a note or label?
> > Maybe we should move ix86_align_loops into a separate pass and insert
> > the pass just before pass_final.
> The patch inserts .p2align after endbr pass, it can also fix the issue.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Any comments?
Committed
>
> gcc/ChangeLog:
>
> PR target/116174
> * config/i386/i386.cc (ix86_align_loops): Move this to ..
> * config/i386/i386-features.cc (ix86_align_loops): .. here.
> (class pass_align_tight_loops): New class.
> (make_pass_align_tight_loops): New function.
> * config/i386/i386-passes.def: Insert pass_align_tight_loops
> after pass_insert_endbr_and_patchable_area.
> * config/i386/i386-protos.h (make_pass_align_tight_loops): New
> declare.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116174.c: New test.
> ---
>  gcc/config/i386/i386-features.cc | 190 +++
>  gcc/config/i386/i386-passes.def  |   3 +
>  gcc/config/i386/i386-protos.h|   1 +
>  gcc/config/i386/i386.cc  | 146 -
>  gcc/testsuite/gcc.target/i386/pr116174.c |  12 ++
>  5 files changed, 206 insertions(+), 146 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116174.c
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index c36d181f2d6..7e80e7b0103 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -3417,6 +3417,196 @@ make_pass_apx_nf_convert (gcc::context *ctxt)
>return new pass_apx_nf_convert (ctxt);
>  }
>
> +/* When a hot loop can be fit into one cacheline,
> +   force align the loop without considering the max skip.  */
> +static void
> +ix86_align_loops ()
> +{
> +  basic_block bb;
> +
> +  /* Don't do this when we don't know cache line size.  */
> +  if (ix86_cost->prefetch_block == 0)
> +return;
> +
> +  loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
> +  profile_count count_threshold = cfun->cfg->count_max / 
> param_align_threshold;
> +  FOR_EACH_BB_FN (bb, cfun)
> +{
> +  rtx_insn *label = BB_HEAD (bb);
> +  bool has_fallthru = 0;
> +  edge e;
> +  edge_iterator ei;
> +
> +  if (!LABEL_P (label))
> +   continue;
> +
> +  profile_count fallthru_count = profile_count::zero ();
> +  profile_count branch_count = profile_count::zero ();
> +
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +   {
> + if (e->flags & EDGE_FALLTHRU)
> +   has_fallthru = 1, fallthru_count += e->count ();
> + else
> +   branch_count += e->count ();
> +   }
> +
> +  if (!fallthru_count.initialized_p () || !branch_count.initialized_p ())
> +   continue;
> +
> +  if (bb->loop_father
> + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun)
> + && (has_fallthru
> + ? (!(single_succ_p (bb)
> +  && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun))
> +&& optimize_bb_for_speed_p (bb)
> +&& branch_count + fallthru_count > count_threshold
> +&& (branch_count > fallthru_count * 
> param_align_loop_iterations))
> + /* In case there'no fallthru for the loop.
> +Nops inserted won't be executed.  */
> + : (branch_count > count_threshold
> +|| (bb->count > bb->prev_bb->count * 10
> +&& (bb->prev_bb->count
> +<= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2)
> +   {
> + rtx_insn* insn, *end_insn;
> + HOST_WIDE_INT size = 0;
> + bool padding_p = true;
> + basic_block tbb = bb;
> + unsigned cond_branch_num = 0;
> + bool detect_tight_loop_p = false;
> +
> + for (unsigned int i = 0; i != bb->loop_father->num_nodes;
> +  i++, tbb = tbb->next_bb)
> +   {
> + /* Only handle continuous cfg layout. */
> + if (bb->loop_father != tbb->loop_father)
> +   {
> + padding_p = false;
> + break;
> +   }
> +
> + FOR_BB_INSNS (tbb, insn)
> +   {
> + if (!NONDEBUG_INSN_P (insn))
> +   continue;
> + size += ix86_min_insn_size (insn);
> +
> + /* We don't know size of inline asm.
> +Don't align loop for call.  */
> + if (asm_noperands (PATTERN (insn)) >= 0
> + || CALL_P (insn))
> +   {
> + size = -1;
> + break;
> +   }
> +   }
> +
> + if (size == -1 || size > ix86_cost->prefetch_block)
> +   {
> + padding_p = false;
> + break;
> +   }
> 

Re: [PATCH 0/1] Initial support for AVX10.2

2024-08-12 Thread Hongtao Liu
On Thu, Aug 1, 2024 at 3:50 PM Haochen Jiang  wrote:
>
> Hi all,
>
> AVX10.2 tech details has been just published on July 31st in the
> following link:
>
> https://cdrdv2.intel.com/v1/dl/getContent/828965
>
> For new features and instructions, we could divide them into two parts.
> One is ymm rounding control, the other is the new instructions.
>
> In the following weeks, we plan to upstream ymm rounding part first,
> following by new instructions. After all of them upstreamed, we will
> also upstream several patches optimizing codegen with new AVX10.2
> instructions.
>
> The patch coming next is the initial support for AVX10.2. This patch
> will be the foundation of all our patches. It adds the support for
> cpuid, option, target attribute, etc.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
>


-- 
BR,
Hongtao


Re: PING: [PATCH] x86: Update BB_HEAD when aligning BB_HEAD

2024-08-11 Thread Hongtao Liu
On Mon, Aug 12, 2024 at 6:59 AM H.J. Lu  wrote:
>
> On Thu, Aug 8, 2024 at 6:53 PM H.J. Lu  wrote:
> >
> > When we emit .p2align to align BB_HEAD, we must update BB_HEAD.  Otherwise
> > ENDBR will be inserted as the wrong place.
> >
> > gcc/
> >
> > PR target/116174
> > * config/i386/i386.cc (ix86_align_loops): Update BB_HEAD when
> > aligning BB_HEAD
> >
> > gcc/testsuite/
> >
> > PR target/116174
> > * gcc.target/i386/pr116174.c: New test.
> >
> > Signed-off-by: H.J. Lu 
> > ---
> >  gcc/config/i386/i386.cc  |  7 +--
> >  gcc/testsuite/gcc.target/i386/pr116174.c | 12 
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr116174.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 77c441893b4..ec6cc5e3548 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23528,8 +23528,11 @@ ix86_align_loops ()
> >
> >   if (padding_p && detect_tight_loop_p)
> > {
> > - emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 
> > (size)),
> > -   GEN_INT (0)), label);
> > + rtx_insn *align =
> > +   emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 
> > (size)),
> > + GEN_INT (0)), label);
> > + if (BB_HEAD (bb) == label)
> > +   BB_HEAD (bb) = align;
Are there any assumptions that BB_HEAD must be a note or label?
Maybe we should move ix86_align_loops into a separate pass and insert
the pass just before pass_final.

> >   /* End of function.  */
> >   if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun))
> > break;
> > diff --git a/gcc/testsuite/gcc.target/i386/pr116174.c 
> > b/gcc/testsuite/gcc.target/i386/pr116174.c
> > new file mode 100644
> > index 000..8877d0b51af
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr116174.c
> > @@ -0,0 +1,12 @@
> > +/* { dg-do compile { target *-*-linux* } } */
> > +/* { dg-options "-O2 -fcf-protection=branch" } */
> > +
> > +char *
> > +foo (char *dest, const char *src)
> > +{
> > +  while ((*dest++ = *src++) != '\0')
> > +/* nothing */;
> > +  return --dest;
> > +}
> > +
> > +/* { dg-final { scan-assembler "\t\.cfi_startproc\n\tendbr(32|64)\n" } } */
> > --
> > 2.45.2
> >
>
> PING.
>
> --
> H.J.



-- 
BR,
Hongtao


Re: [PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-31 Thread Hongtao Liu
On Tue, Jul 30, 2024 at 11:04 AM liuhongt  wrote:
>
> (insn 98 94 387 2 (parallel [
> (set (reg:TI 337 [ _32 ])
> (ashift:TI (reg:TI 329)
> (reg:QI 521)))
> (clobber (reg:CC 17 flags))
> ]) "test.c":11:13 953 {ashlti3_doubleword}
>
> is reloaded into
>
> (insn 98 452 387 2 (parallel [
> (set (reg:TI 0 ax [orig:337 _32 ] [337])
> (ashift:TI (const_int 1671291085 [0x639de0cd])
> (reg:QI 2 cx [521])))
> (clobber (reg:CC 17 flags))
>
> since constraint n in the pattern accepts that.
> (Not sure why reload doesn't check predicate)
>
> (define_insn "ashl3_doubleword"
>   [(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
>
> The patch fixes the mismatch between constraint and predicate.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/116096
> * config/i386/constraints.md (Wc): New constraint for integer
> 1 or -1.
> * config/i386/i386.md (ashl3_doubleword): Refine
> constraint with Wc.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116096.c: New test.
> ---
>  gcc/config/i386/constraints.md   |  6 ++
>  gcc/config/i386/i386.md  |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116096.c | 26 
>  3 files changed, 33 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116096.c
>
> diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> index 7508d7a58bd..154cbccd09e 100644
> --- a/gcc/config/i386/constraints.md
> +++ b/gcc/config/i386/constraints.md
> @@ -254,6 +254,12 @@ (define_constraint "Wb"
>(and (match_code "const_int")
> (match_test "IN_RANGE (ival, 0, 7)")))
>
> +(define_constraint "Wc"
> +  "Integer constant -1 or 1."
> +  (and (match_code "const_int")
> +   (ior (match_test "op == constm1_rtx")
> +   (match_test "op == const1_rtx"
> +
>  (define_constraint "Ww"
>"Integer constant in the range 0 @dots{} 15, for 16-bit shifts."
>(and (match_code "const_int")
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 6207036a2a0..79d5de5b46a 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1"
>
>  (define_insn "ashl3_doubleword"
>[(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> -   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> +   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0Wc,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
> (clobber (reg:CC FLAGS_REG))]
>""
> diff --git a/gcc/testsuite/gcc.target/i386/pr116096.c 
> b/gcc/testsuite/gcc.target/i386/pr116096.c
> new file mode 100644
> index 000..5ef39805f58
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116096.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O2 -flive-range-shrinkage -fno-peephole2 -mstackrealign 
> -Wno-psabi" } */
> +
> +typedef char U __attribute__((vector_size (32)));
> +typedef unsigned V __attribute__((vector_size (32)));
> +typedef __int128 W __attribute__((vector_size (32)));
> +U g;
> +
> +W baz ();
> +
> +static inline U
> +bar (V x, W y)
> +{
> +  y = y | y << (W) x;
> +  return (U)y;
> +}
> +
> +void
> +foo (W w)
> +{
> +  g = g <<
> +bar ((V){baz ()[1], 3, 3, 5, 7},
> +(W){w[0], ~(int) 2623676210}) >>
> +bar ((V){baz ()[1]},
> +(W){-w[0], ~(int) 2623676210});
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Fix memory constraint for APX NF

2024-07-31 Thread Hongtao Liu
On Thu, Aug 1, 2024 at 10:03 AM Kong, Lingling  wrote:
>
>
>
> > -Original Message-
> > From: Liu, Hongtao 
> > Sent: Thursday, August 1, 2024 9:35 AM
> > To: Kong, Lingling ; gcc-patches@gcc.gnu.org
> > Cc: Wang, Hongyu 
> > Subject: RE: [PATCH] i386: Fix memory constraint for APX NF
> >
> >
> >
> > > -Original Message-
> > > From: Kong, Lingling 
> > > Sent: Thursday, August 1, 2024 9:30 AM
> > > To: gcc-patches@gcc.gnu.org
> > > Cc: Liu, Hongtao ; Wang, Hongyu
> > > 
> > > Subject: [PATCH] i386: Fix memory constraint for APX NF
> > >
> > > The je constraint should be used for APX NDD ADD with register source
> > > operand. The jM is for APX NDD patterns with immediate operand.
> > But these 2 alternatives is for Non-NDD.
> The jM constraint is for the size limit of 15 byes when non-default address 
> space,
> It also work to APX NF. The je is for TLS code with EVEX prefix for ADD, and 
> APX NF
> also has the EVEX prefix.
I see, could you also adjust apx_ndd_add_memory_operand and
apx_ndd_memory_operand to apx_evex_add_memory_operand and
apx_evex_memory_operand, and change the comments, but it can be a
separate patch.
The patch LGTM.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > * config/i386/i386.md (nf_mem_constraint): Fixed the constraint
> > > for the define_subst_attr.
> > > (nf_mem_constraint): Added new define_subst_attr.
> > > (*add_1): Fixed the constraint.
> > > ---
> > >  gcc/config/i386/i386.md | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index
> > > fb10fdc9f96..aa7220ee17c 100644
> > > --- a/gcc/config/i386/i386.md
> > > +++ b/gcc/config/i386/i386.md
> > > @@ -6500,7 +6500,8 @@
> > >  (define_subst_attr "nf_name" "nf_subst" "_nf" "")  (define_subst_attr
> > > "nf_prefix" "nf_subst" "%{nf%} " "")  (define_subst_attr "nf_condition"
> > > "nf_subst" "TARGET_APX_NF" "true") -(define_subst_attr
> > > "nf_mem_constraint" "nf_subst" "je" "m")
> > > +(define_subst_attr "nf_add_mem_constraint" "nf_subst" "je" "m")
> > > +(define_subst_attr "nf_mem_constraint" "nf_subst" "jM" "m")
> > >  (define_subst_attr "nf_applied" "nf_subst" "true" "false")
> > > (define_subst_attr "nf_nonf_attr" "nf_subst"  "noapx_nf" "*")
> > > (define_subst_attr "nf_nonf_x64_attr" "nf_subst" "noapx_nf" "x64") @@ -
> > 6514,7 +6515,7 @@
> > > (clobber (reg:CC FLAGS_REG))])
> > >
> > >  (define_insn "*add_1"
> > > -  [(set (match_operand:SWI48 0 "nonimmediate_operand"
> > > "=rm,r,r,r,r,r,r,r")
> > > +  [(set (match_operand:SWI48 0 "nonimmediate_operand"
> > > + "=r,r,r,r,r,r,r,r")
> > > (plus:SWI48
> > >   (match_operand:SWI48 1 "nonimmediate_operand"
> > > "%0,0,0,r,r,rje,jM,r")
> > >   (match_operand:SWI48 2 "x86_64_general_operand"
> > > "r,e,BM,0,le,r,e,BM")))]
> > > --
> > > 2.31.1



-- 
BR,
Hongtao


Re: [PATCH] i386: Mark target option with optimization when enabled with opt level [PR116065]

2024-07-31 Thread Hongtao Liu
On Tue, Jul 30, 2024 at 1:05 PM Hongyu Wang  wrote:
>
> Richard Biener  于2024年7月26日周五 19:45写道:
> >
> > On Fri, Jul 26, 2024 at 10:50 AM Hongyu Wang  wrote:
> > >
> > > Hi,
> > >
> > > When introducing munroll-only-small-loops, the option was marked as
> > > Target Save and added to -O2 default which makes attribute(optimize)
> > > resets target option and causing error when cmdline has O1 and
> > > funciton attribute has O2 and other target options. Mark this option
> > > as Optimization to fix.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk and backport down to gcc-13?
> >
> > Note this requires bumping LTO_minor_version on branches.
> >
>
> Yes, as the aarch64 fix was not backported I'd like to just fix it for trunk.
Ok for trunk only.
>
> > > gcc/ChangeLog
> > >
> > > PR target/116065
> > > * config/i386/i386.opt (munroll-only-small-loops): Mark as
> > > Optimization instead of Save.
> > >
> > > gcc/testsuite/ChangeLog
> > >
> > > PR target/116065
> > > * gcc.target/i386/pr116065.c: New test.
> > > ---
> > >  gcc/config/i386/i386.opt |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr116065.c | 24 
> > >  2 files changed, 25 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr116065.c
> > >
> > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > > index 353fffb2343..52054bc018a 100644
> > > --- a/gcc/config/i386/i386.opt
> > > +++ b/gcc/config/i386/i386.opt
> > > @@ -1259,7 +1259,7 @@ Target Mask(ISA2_RAOINT) Var(ix86_isa_flags2) Save
> > >  Support RAOINT built-in functions and code generation.
> > >
> > >  munroll-only-small-loops
> > > -Target Var(ix86_unroll_only_small_loops) Init(0) Save
> > > +Target Var(ix86_unroll_only_small_loops) Init(0) Optimization
> > >  Enable conservative small loop unrolling.
> > >
> > >  mlam=
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr116065.c 
> > > b/gcc/testsuite/gcc.target/i386/pr116065.c
> > > new file mode 100644
> > > index 000..083e70f2413
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr116065.c
> > > @@ -0,0 +1,24 @@
> > > +/* PR target/116065  */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O1 -mno-avx" } */
> > > +
> > > +#ifndef __AVX__
> > > +#pragma GCC push_options
> > > +#pragma GCC target("avx")
> > > +#define __DISABLE_AVX__
> > > +#endif /* __AVX__ */
> > > +
> > > +extern inline double __attribute__((__gnu_inline__,__always_inline__))
> > > + foo (double x) { return x; }
> > > +
> > > +#ifdef __DISABLE_AVX__
> > > +#undef __DISABLE_AVX__
> > > +#pragma GCC pop_options
> > > +#endif /* __DISABLE_AVX__ */
> > > +
> > > +void __attribute__((target ("avx"), optimize(3)))
> > > +bar (double *p)
> > > +{
> > > +  *p = foo (*p);
> > > +}
> > > +
> > > --
> > > 2.31.1
> > >



-- 
BR,
Hongtao


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Hongtao Liu
On Wed, Jul 31, 2024 at 3:17 PM Uros Bizjak  wrote:
>
> On Wed, Jul 31, 2024 at 9:11 AM Hongtao Liu  wrote:
> >
> > On Wed, Jul 31, 2024 at 1:06 AM Uros Bizjak  wrote:
> > >
> > > On Tue, Jul 30, 2024 at 3:00 PM Richard Biener  wrote:
> > > >
> > > > On Tue, 30 Jul 2024, Alexander Monakov wrote:
> > > >
> > > > >
> > > > > On Tue, 30 Jul 2024, Richard Biener wrote:
> > > > >
> > > > > > > Oh, and please add a small comment why we don't use XFmode here.
> > > > > >
> > > > > > Will do.
> > > > > >
> > > > > > /* Do not enable XFmode, there is padding in it and it 
> > > > > > suffers
> > > > > >from normalization upon load like SFmode and DFmode when
> > > > > >not using SSE.  */
> > > > >
> > > > > Is it really true? I have no evidence of FLDT performing normalization
> > > > > (as mentioned in PR 114659, if it did, there would be no way to 
> > > > > spill/reload
> > > > > x87 registers).
> > > >
> > > > What mangling fld performs depends on the contents of the FP control
> > > > word which is awkward.  IIRC there's at least a bugreport that it
> > > > turns sNaN into a qNaN, it seems I was wrong about denormals
> > > > (when DM is not masked).  And yes, IIRC x87 instability is also
> > > > related to spills (IIRC we spill in the actual mode of the reg, not in
> > > > XFmode), but -fexcess-precision=standard should hopefully avoid that.
> > > > It's also not clear whether all implementations conformed to the
> > > > specs wrt extended-precision format loads.
> > >
> > > FYI, FLDT does not mangle long-double values and does not generate
> > > exceptions. Please see [1], but ignore shadowed text and instead read
> > > the "Floating-Point Exceptions" section. So, as far as hardware is
> > > concerned, it *can* be used to transfer 10-byte values, but I don't
> > > want to judge from the compiler PoV if this is the way to go. We can
> > > enable it, perhaps temporarily to experiment a bit - it is easy to
> > > disable if it causes problems.
> > >
> > > Let's CC Intel folks for their opinion, if it is worth using an aging
> > > x87 to transfer 80-bit data.
> > I prefer not, in another hook ix86_can_change_mode_class, we have
> >
> > 20372  /* x87 registers can't do subreg at all, as all values are
> > reformatted
> > 20373 to extended precision.  */
> > 20374  if (MAYBE_FLOAT_CLASS_P (regclass))
> > 20375return false;
>
> No, the above applies to SFmode subreg of XFmode value, which is a
> no-go. My question refers to the plain XFmode (80-bit) moves, where
> x87 is used simply to:
>
> fldt mem1
> ...
> fstp mem2
>
> where x87 is used to perform a move from one 80-bit location to the other.
>
> > I guess it eventually needs reload for XFmode.
>
> There are no reloads, as we would like to perform bit-exact 80-bit
> move, e.g. array of 10 chars.
Oh, It's memory copy.
I suspect that the hardware doesn't enable memory renaming for x87 instructions.
So I prefer not.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-31 Thread Hongtao Liu
On Wed, Jul 31, 2024 at 1:06 AM Uros Bizjak  wrote:
>
> On Tue, Jul 30, 2024 at 3:00 PM Richard Biener  wrote:
> >
> > On Tue, 30 Jul 2024, Alexander Monakov wrote:
> >
> > >
> > > On Tue, 30 Jul 2024, Richard Biener wrote:
> > >
> > > > > Oh, and please add a small comment why we don't use XFmode here.
> > > >
> > > > Will do.
> > > >
> > > > /* Do not enable XFmode, there is padding in it and it suffers
> > > >from normalization upon load like SFmode and DFmode when
> > > >not using SSE.  */
> > >
> > > Is it really true? I have no evidence of FLDT performing normalization
> > > (as mentioned in PR 114659, if it did, there would be no way to 
> > > spill/reload
> > > x87 registers).
> >
> > What mangling fld performs depends on the contents of the FP control
> > word which is awkward.  IIRC there's at least a bugreport that it
> > turns sNaN into a qNaN, it seems I was wrong about denormals
> > (when DM is not masked).  And yes, IIRC x87 instability is also
> > related to spills (IIRC we spill in the actual mode of the reg, not in
> > XFmode), but -fexcess-precision=standard should hopefully avoid that.
> > It's also not clear whether all implementations conformed to the
> > specs wrt extended-precision format loads.
>
> FYI, FLDT does not mangle long-double values and does not generate
> exceptions. Please see [1], but ignore shadowed text and instead read
> the "Floating-Point Exceptions" section. So, as far as hardware is
> concerned, it *can* be used to transfer 10-byte values, but I don't
> want to judge from the compiler PoV if this is the way to go. We can
> enable it, perhaps temporarily to experiment a bit - it is easy to
> disable if it causes problems.
>
> Let's CC Intel folks for their opinion, if it is worth using an aging
> x87 to transfer 80-bit data.
I prefer not, in another hook ix86_can_change_mode_class, we have

20372  /* x87 registers can't do subreg at all, as all values are
reformatted
20373 to extended precision.  */
20374  if (MAYBE_FLOAT_CLASS_P (regclass))
20375return false;

I guess it eventually needs reload for XFmode.
>
> [1] https://www.felixcloutier.com/x86/fld
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH] i386: Remove ndd support for *add_4 [PR113744]

2024-07-30 Thread Hongtao Liu
On Wed, Jul 31, 2024 at 2:08 PM Kong, Lingling  wrote:
>
> *add_4 and *adddi_4 are for shorter opcode from cmp to inc/dec or add 
> $128.
>
> But NDD code is longer than the cmp code, so there is no need to support NDD.
>
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>
> Ok for trunk?
Ok.
>
>
>
> gcc/ChangeLog:
>
>
>
>PR target/113744
>
>* config/i386/i386.md (*add_4): Remove NDD support.
>
>(*adddi_4): Ditto.
>
>
>
> Co-Authored-By: Hu, Lin1 lin1...@intel.com
>
> ---
>
> gcc/config/i386/i386.md | 40 +++-
>
> 1 file changed, 15 insertions(+), 25 deletions(-)
>
>
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
>
> index fb10fdc9f96..3c293c14656 100644
>
> --- a/gcc/config/i386/i386.md
>
> +++ b/gcc/config/i386/i386.md
>
> @@ -7146,35 +7146,31 @@
>
> (define_insn "*adddi_4"
>
>[(set (reg FLAGS_REG)
>
>   (compare
>
> -(match_operand:DI 1 "nonimmediate_operand" "0,rm")
>
> -(match_operand:DI 2 "x86_64_immediate_operand" "e,e")))
>
> -   (clobber (match_scratch:DI 0 "=r,r"))]
>
> +   (match_operand:DI 1 "nonimmediate_operand" "0")
>
> +   (match_operand:DI 2 "x86_64_immediate_operand" "e")))
>
> +   (clobber (match_scratch:DI 0 "=r"))]
>
>"TARGET_64BIT
>
> && ix86_match_ccmode (insn, CCGCmode)"
>
> {
>
> -  bool use_ndd = get_attr_isa (insn) == ISA_APX_NDD;
>
>switch (get_attr_type (insn))
>
>  {
>
>  case TYPE_INCDEC:
>
>if (operands[2] == constm1_rtx)
>
> -return use_ndd ? "inc{q}\t{%1, %0|%0, %1}" : "inc{q}\t%0";
>
> + return "inc{q}\t%0";
>
>else
>
>  {
>
> gcc_assert (operands[2] == const1_rtx);
>
> -return use_ndd ? "dec{q}\t{%1, %0|%0, %1}" : "dec{q}\t%0";
>
> +   return "dec{q}\t%0";
>
>   }
>
>  default:
>
>if (x86_maybe_negate_const_int (&operands[2], DImode))
>
> -  return use_ndd ? "add{q}\t{%2, %1, %0|%0, %1, %2}"
>
> -: "add{q}\t{%2, %0|%0, %2}";
>
> + return "add{q}\t{%2, %0|%0, %2}";
>
> -  return use_ndd ? "sub{q}\t{%2, %1, %0|%0, %1, %2}"
>
> -  : "sub{q}\t{%2, %0|%0, %2}";
>
> +  return "sub{q}\t{%2, %0|%0, %2}";
>
>  }
>
> }
>
> -  [(set_attr "isa" "*,apx_ndd")
>
> -   (set (attr "type")
>
> +  [(set (attr "type")
>
>   (if_then_else (match_operand:DI 2 "incdec_operand")
>
>   (const_string "incdec")
>
>   (const_string "alu")))
>
> @@ -7195,36 +7191,30 @@
>
> (define_insn "*add_4"
>
>[(set (reg FLAGS_REG)
>
>   (compare
>
> -(match_operand:SWI124 1 "nonimmediate_operand" "0,rm")
>
> +   (match_operand:SWI124 1 "nonimmediate_operand" "0")
>
> (match_operand:SWI124 2 "const_int_operand")))
>
> -   (clobber (match_scratch:SWI124 0 "=,r"))]
>
> +   (clobber (match_scratch:SWI124 0 "="))]
>
>"ix86_match_ccmode (insn, CCGCmode)"
>
> {
>
> -  bool use_ndd = get_attr_isa (insn) == ISA_APX_NDD;
>
>switch (get_attr_type (insn))
>
>  {
>
>  case TYPE_INCDEC:
>
>if (operands[2] == constm1_rtx)
>
> -return use_ndd ? "inc{}\t{%1, %0|%0, %1}"
>
> -: "inc{}\t%0";
>
> +return "inc{}\t%0";
>
>else
>
>  {
>
> gcc_assert (operands[2] == const1_rtx);
>
> -return use_ndd ? "dec{}\t{%1, %0|%0, %1}"
>
> -: "dec{}\t%0";
>
> +   return "dec{}\t%0";
>
>   }
>
>  default:
>
>if (x86_maybe_negate_const_int (&operands[2], mode))
>
> -  return use_ndd ? "add{}\t{%2, %1, %0|%0, %1, %2}"
>
> -: "add{}\t{%2, %0|%0, %2}";
>
> + return "add{}\t{%2, %0|%0, %2}";
>
> -  return use_ndd ? "sub{}\t{%2, %1, %0|%0, %1, %2}"
>
> -  : "sub{}\t{%2, %0|%0, %2}";
>
> +  return "sub{}\t{%2, %0|%0, %2}";
>
>  }
>
> }
>
> -  [(set_attr "isa" "*,apx_ndd")
>
> -   (set (attr "type")
>
> +  [(set (attr "type")
>
>   (if_then_else (match_operand: 2 "incdec_operand")
>
>   (const_string "incdec")
>
>   (const_string "alu")))
>
> --
>
> 2.31.1



-- 
BR,
Hongtao


Re: [PATCH v2] i386: Add non-optimize prefetchi intrins

2024-07-29 Thread Hongtao Liu
On Tue, Jul 30, 2024 at 9:27 AM Hongtao Liu  wrote:
>
> On Fri, Jul 26, 2024 at 4:55 PM Haochen Jiang  wrote:
> >
> > Hi all,
> >
> > I added related O0 testcase in this patch.
> >
> > Ok for trunk and backport to GCC 14 and GCC 13?
> Ok.
I mean for trunk, and it needs jakub's approval to backport to GCC14.2.
> >
> > Thx,
> > Haochen
> >
> > ---
> >
> > Changes in v2: Add testcases.
> >
> > ---
> >
> > Under -O0, with the "newly" introduced intrins, the variable will be
> > transformed as mem instead of the origin symbol_ref. The compiler will
> > then treat the operand as invalid and turn the operation into nop, which
> > is not expected. Use macro for non-optimize to keep the variable as
> > symbol_ref just as how prefetch intrin does.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/prfchiintrin.h
> > (_m_prefetchit0): Add macro for non-optimized option.
> > (_m_prefetchit1): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/prefetchi-1b.c: New test.
> > ---
> >  gcc/config/i386/prfchiintrin.h   |  9 +++
> >  gcc/testsuite/gcc.target/i386/prefetchi-1b.c | 26 
> >  2 files changed, 35 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/prefetchi-1b.c
> >
> > diff --git a/gcc/config/i386/prfchiintrin.h b/gcc/config/i386/prfchiintrin.h
> > index dfca89c7d16..d6580e504c0 100644
> > --- a/gcc/config/i386/prfchiintrin.h
> > +++ b/gcc/config/i386/prfchiintrin.h
> > @@ -37,6 +37,7 @@
> >  #define __DISABLE_PREFETCHI__
> >  #endif /* __PREFETCHI__ */
> >
> > +#ifdef __OPTIMIZE__
> >  extern __inline void
> >  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> >  _m_prefetchit0 (void* __P)
> > @@ -50,6 +51,14 @@ _m_prefetchit1 (void* __P)
> >  {
> >__builtin_ia32_prefetchi (__P, 2);
> >  }
> > +#else
> > +#define _m_prefetchit0(P)  \
> > +  __builtin_ia32_prefetchi(P, 3);
> > +
> > +#define _m_prefetchit1(P)  \
> > +  __builtin_ia32_prefetchi(P, 2);
> > +
> > +#endif
> >
> >  #ifdef __DISABLE_PREFETCHI__
> >  #undef __DISABLE_PREFETCHI__
> > diff --git a/gcc/testsuite/gcc.target/i386/prefetchi-1b.c 
> > b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c
> > new file mode 100644
> > index 000..93139554d3c
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c
> > @@ -0,0 +1,26 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-mprefetchi -O0" } */
> > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ 
> > \\t\]+bar\\(%rip\\)" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ 
> > \\t\]+bar\\(%rip\\)" 1 } } */
> > +
> > +#include 
> > +
> > +int
> > +bar (int a)
> > +{
> > +  return a + 1;
> > +}
> > +
> > +int
> > +foo1 (int b)
> > +{
> > +  _m_prefetchit0 (bar);
> > +  return bar (b) + 1;
> > +}
> > +
> > +int
> > +foo2 (int b)
> > +{
> > +  _m_prefetchit1 (bar);
> > +  return bar (b) + 1;
> > +}
> > --
> > 2.31.1
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH v2] i386: Add non-optimize prefetchi intrins

2024-07-29 Thread Hongtao Liu
On Fri, Jul 26, 2024 at 4:55 PM Haochen Jiang  wrote:
>
> Hi all,
>
> I added related O0 testcase in this patch.
>
> Ok for trunk and backport to GCC 14 and GCC 13?
Ok.
>
> Thx,
> Haochen
>
> ---
>
> Changes in v2: Add testcases.
>
> ---
>
> Under -O0, with the "newly" introduced intrins, the variable will be
> transformed as mem instead of the origin symbol_ref. The compiler will
> then treat the operand as invalid and turn the operation into nop, which
> is not expected. Use macro for non-optimize to keep the variable as
> symbol_ref just as how prefetch intrin does.
>
> gcc/ChangeLog:
>
> * config/i386/prfchiintrin.h
> (_m_prefetchit0): Add macro for non-optimized option.
> (_m_prefetchit1): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/prefetchi-1b.c: New test.
> ---
>  gcc/config/i386/prfchiintrin.h   |  9 +++
>  gcc/testsuite/gcc.target/i386/prefetchi-1b.c | 26 
>  2 files changed, 35 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/prefetchi-1b.c
>
> diff --git a/gcc/config/i386/prfchiintrin.h b/gcc/config/i386/prfchiintrin.h
> index dfca89c7d16..d6580e504c0 100644
> --- a/gcc/config/i386/prfchiintrin.h
> +++ b/gcc/config/i386/prfchiintrin.h
> @@ -37,6 +37,7 @@
>  #define __DISABLE_PREFETCHI__
>  #endif /* __PREFETCHI__ */
>
> +#ifdef __OPTIMIZE__
>  extern __inline void
>  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>  _m_prefetchit0 (void* __P)
> @@ -50,6 +51,14 @@ _m_prefetchit1 (void* __P)
>  {
>__builtin_ia32_prefetchi (__P, 2);
>  }
> +#else
> +#define _m_prefetchit0(P)  \
> +  __builtin_ia32_prefetchi(P, 3);
> +
> +#define _m_prefetchit1(P)  \
> +  __builtin_ia32_prefetchi(P, 2);
> +
> +#endif
>
>  #ifdef __DISABLE_PREFETCHI__
>  #undef __DISABLE_PREFETCHI__
> diff --git a/gcc/testsuite/gcc.target/i386/prefetchi-1b.c 
> b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c
> new file mode 100644
> index 000..93139554d3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mprefetchi -O0" } */
> +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ 
> \\t\]+bar\\(%rip\\)" 1 } } */
> +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ 
> \\t\]+bar\\(%rip\\)" 1 } } */
> +
> +#include 
> +
> +int
> +bar (int a)
> +{
> +  return a + 1;
> +}
> +
> +int
> +foo1 (int b)
> +{
> +  _m_prefetchit0 (bar);
> +  return bar (b) + 1;
> +}
> +
> +int
> +foo2 (int b)
> +{
> +  _m_prefetchit1 (bar);
> +  return bar (b) + 1;
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.

2024-07-28 Thread Hongtao Liu
On Thu, Jul 25, 2024 at 3:23 PM Hongtao Liu  wrote:
>
> On Wed, Jul 24, 2024 at 3:57 PM liuhongt  wrote:
> >
> > For below pattern, RA may still allocate r162 as v/k register, try to
> > reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
> > which result a linker error.
> >
> > (set (reg:DI 162)
> >  (mem/u/c:DI
> >(const:DI (unspec:DI
> >  [(symbol_ref:DI ("a") [flags 0x60]   > 0x7f621f6e1c60 a>)]
> >  UNSPEC_GOTNTPOFF))
> >
> > Quote from H.J for why linker issue an error.
> > >What do these do:
> > >
> > >leaq__libc_tsd_CTYPE_B@gottpoff(%rip), %rax
> > >vmovq   (%rax), %xmm0
> > >
> > >From x86-64 TLS psABI:
> > >
> > >The assembler generates for the x@gottpoff(%rip) expressions a R X86
> > >64 GOTTPOFF relocation for the symbol x which requests the linker to
> > >generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of
> > >the GOT entry relative to the end of the instruction is then used in
> > >the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at
> > >program startup time by the dynamic linker by looking up the symbol x
> > >in the modules loaded at that point. The offset is written in the GOT
> > >entry and later loaded by the addq instruction.
> > >
> > >The above code sequence looks wrong to me.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk and backport?
Committed and will backport after gcc14.2 is released.
> >
> > gcc/ChangeLog:
> >
> > PR target/116043
> > * config/i386/constraints.md (Bk): Refine to
> > define_special_memory_constraint.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr116043.c: New test.
> > ---
> >  gcc/config/i386/constraints.md   |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr116043.c | 33 
> >  2 files changed, 34 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr116043.c
> >
> > diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> > index 7508d7a58bd..b760e7c221a 100644
> > --- a/gcc/config/i386/constraints.md
> > +++ b/gcc/config/i386/constraints.md
> > @@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm"
> >"@internal Vector memory operand."
> >(match_operand 0 "vector_memory_operand"))
> >
> > -(define_memory_constraint "Bk"
> > +(define_special_memory_constraint "Bk"
> >"@internal TLS address that allows insn using non-integer registers."
> >(and (match_operand 0 "memory_operand")
> > (not (match_test "ix86_gpr_tls_address_pattern_p (op)"
> > diff --git a/gcc/testsuite/gcc.target/i386/pr116043.c 
> > b/gcc/testsuite/gcc.target/i386/pr116043.c
> > new file mode 100644
> > index 000..76553496c10
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr116043.c
> > @@ -0,0 +1,33 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx512bf16 -O3" } */
> > +/* { dg-final { scan-assembler-not {(?n)lea.*@gottpoff} } } */
> > +
> > +extern __thread int a, c, i, j, k, l;
> > +int *b;
> > +struct d {
> > +  int e;
> > +} f, g;
> > +char *h;
> > +
> > +void m(struct d *n) {
> > +  b = &k;
> > +  for (; n->e; b++, n--) {
> > +i = b && a;
> > +if (i)
> > +  j = c;
> > +  }
> > +}
> > +
> > +char *o(struct d *n) {
> > +  for (; n->e;)
> > +return h;
> > +}
> > +
> > +int q() {
> > +  if (l)
> > +return 1;
> > +  int p = *o(&g);
> > +  m(&f);
> > +  m(&g);
> > +  l = p;
> > +}
> > --
> > 2.31.1
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-26 Thread Hongtao Liu
On Fri, Jul 26, 2024 at 2:59 PM liuhongt  wrote:
>
> (insn 98 94 387 2 (parallel [
> (set (reg:TI 337 [ _32 ])
> (ashift:TI (reg:TI 329)
> (reg:QI 521)))
> (clobber (reg:CC 17 flags))
> ]) "test.c":11:13 953 {ashlti3_doubleword}
>
> is reloaded into
>
> (insn 98 452 387 2 (parallel [
> (set (reg:TI 0 ax [orig:337 _32 ] [337])
> (ashift:TI (const_int 1671291085 [0x639de0cd])
> (reg:QI 2 cx [521])))
> (clobber (reg:CC 17 flags))
>
> since constraint n in the pattern accepts that.
> (Not sure why reload doesn't check predicate)
>
> (define_insn "ashl3_doubleword"
>   [(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
>
> The patch fixes the mismatch between constraint and predicate.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
Please ignore this, I need to support 1 in the constraint.
>
> gcc/ChangeLog:
>
> PR target/116096
> * config/i386/constraints.md (BC): Move TARGET_SSE to
> vector_all_ones_operand.
> * config/i386/i386.md (ashl3_doubleword): Refine
> constraint with BC.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116096.c: New test.
> ---
>  gcc/config/i386/constraints.md   |  4 ++--
>  gcc/config/i386/i386.md  |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116096.c | 26 
>  3 files changed, 29 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116096.c
>
> diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> index 7508d7a58bd..fd032c2b9f0 100644
> --- a/gcc/config/i386/constraints.md
> +++ b/gcc/config/i386/constraints.md
> @@ -225,8 +225,8 @@ (define_constraint "Bz"
>
>  (define_constraint "BC"
>"@internal integer SSE constant with all bits set operand."
> -  (and (match_test "TARGET_SSE")
> -   (ior (match_test "op == constm1_rtx")
> +  (ior (match_test "op == constm1_rtx")
> +   (and (match_test "TARGET_SSE")
> (match_operand 0 "vector_all_ones_operand"
>
>  (define_constraint "BF"
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 6207036a2a0..9c4e847fba1 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1"
>
>  (define_insn "ashl3_doubleword"
>[(set (match_operand:DWI 0 "register_operand" "=&r,&r")
> -   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> +   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
> (clobber (reg:CC FLAGS_REG))]
>""
> diff --git a/gcc/testsuite/gcc.target/i386/pr116096.c 
> b/gcc/testsuite/gcc.target/i386/pr116096.c
> new file mode 100644
> index 000..5ef39805f58
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116096.c
> @@ -0,0 +1,26 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O2 -flive-range-shrinkage -fno-peephole2 -mstackrealign 
> -Wno-psabi" } */
> +
> +typedef char U __attribute__((vector_size (32)));
> +typedef unsigned V __attribute__((vector_size (32)));
> +typedef __int128 W __attribute__((vector_size (32)));
> +U g;
> +
> +W baz ();
> +
> +static inline U
> +bar (V x, W y)
> +{
> +  y = y | y << (W) x;
> +  return (U)y;
> +}
> +
> +void
> +foo (W w)
> +{
> +  g = g <<
> +bar ((V){baz ()[1], 3, 3, 5, 7},
> +(W){w[0], ~(int) 2623676210}) >>
> +bar ((V){baz ()[1]},
> +(W){-w[0], ~(int) 2623676210});
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH Ping] i386: Use BLKmode for {ld,st}tilecfg

2024-07-25 Thread Hongtao Liu
On Fri, Jul 26, 2024 at 2:28 PM Jiang, Haochen  wrote:
>
> Ping for this patch
>
> Thx,
> Haochen
>
> > -Original Message-
> > From: Haochen Jiang 
> > Sent: Thursday, July 18, 2024 9:45 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: Liu, Hongtao ; hjl.to...@gmail.com;
> > ubiz...@gmail.com
> > Subject: [PATCH] i386: Use BLKmode for {ld,st}tilecfg
> >
> > Hi all,
> >
> > For AMX instructions related with memory, we will treat the memory
> > size as not specified since there won't be different size causing
> > confusion for memory.
> >
> > This will change the output under Intel mode, which is broken for now when
> > using with assembler and aligns to current binutils behavior.
> >
> > Bootstrapped and regtested on x86-64-pc-linux-gnu. Ok for trunk?
Ok.
> >
> > Thx,
> > Haochen
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-expand.cc (ix86_expand_builtin): Change
> >   from XImode to BLKmode.
> >   * config/i386/i386.md (ldtilecfg): Change XI to BLK.
> >   (sttilecfg): Ditto.
> > ---
> >  gcc/config/i386/i386-expand.cc |  2 +-
> >  gcc/config/i386/i386.md| 12 +---
> >  2 files changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 9a31e6df2aa..d9ad06264aa 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -14198,7 +14198,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx
> > subtarget,
> > op0 = convert_memory_address (Pmode, op0);
> > op0 = copy_addr_to_reg (op0);
> >   }
> > -  op0 = gen_rtx_MEM (XImode, op0);
> > +  op0 = gen_rtx_MEM (BLKmode, op0);
> >if (fcode == IX86_BUILTIN_LDTILECFG)
> >   icode = CODE_FOR_ldtilecfg;
> >else
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index de9f4ba0496..86989d4875a 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -28975,24 +28975,22 @@
> > (set_attr "type" "other")])
> >
> >  (define_insn "ldtilecfg"
> > -  [(unspec_volatile [(match_operand:XI 0 "memory_operand" "m")]
> > +  [(unspec_volatile [(match_operand:BLK 0 "memory_operand" "m")]
> >  UNSPECV_LDTILECFG)]
> >"TARGET_AMX_TILE"
> >"ldtilecfg\t%0"
> >[(set_attr "type" "other")
> > (set_attr "prefix" "maybe_evex")
> > -   (set_attr "memory" "load")
> > -   (set_attr "mode" "XI")])
> > +   (set_attr "memory" "load")])
> >
> >  (define_insn "sttilecfg"
> > -  [(set (match_operand:XI 0 "memory_operand" "=m")
> > -(unspec_volatile:XI [(const_int 0)] UNSPECV_STTILECFG))]
> > +  [(set (match_operand:BLK 0 "memory_operand" "=m")
> > +(unspec_volatile:BLK [(const_int 0)] UNSPECV_STTILECFG))]
> >"TARGET_AMX_TILE"
> >"sttilecfg\t%0"
> >[(set_attr "type" "other")
> > (set_attr "prefix" "maybe_evex")
> > -   (set_attr "memory" "store")
> > -   (set_attr "mode" "XI")])
> > +   (set_attr "memory" "store")])
> >
> >  (include "mmx.md")
> >  (include "sse.md")
> > --
> > 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.

2024-07-25 Thread Hongtao Liu
On Wed, Jul 24, 2024 at 3:57 PM liuhongt  wrote:
>
> For below pattern, RA may still allocate r162 as v/k register, try to
> reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
> which result a linker error.
>
> (set (reg:DI 162)
>  (mem/u/c:DI
>(const:DI (unspec:DI
>  [(symbol_ref:DI ("a") [flags 0x60]   a>)]
>  UNSPEC_GOTNTPOFF))
>
> Quote from H.J for why linker issue an error.
> >What do these do:
> >
> >leaq__libc_tsd_CTYPE_B@gottpoff(%rip), %rax
> >vmovq   (%rax), %xmm0
> >
> >From x86-64 TLS psABI:
> >
> >The assembler generates for the x@gottpoff(%rip) expressions a R X86
> >64 GOTTPOFF relocation for the symbol x which requests the linker to
> >generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of
> >the GOT entry relative to the end of the instruction is then used in
> >the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at
> >program startup time by the dynamic linker by looking up the symbol x
> >in the modules loaded at that point. The offset is written in the GOT
> >entry and later loaded by the addq instruction.
> >
> >The above code sequence looks wrong to me.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk and backport?
>
> gcc/ChangeLog:
>
> PR target/116043
> * config/i386/constraints.md (Bk): Refine to
> define_special_memory_constraint.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr116043.c: New test.
> ---
>  gcc/config/i386/constraints.md   |  2 +-
>  gcc/testsuite/gcc.target/i386/pr116043.c | 33 
>  2 files changed, 34 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116043.c
>
> diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> index 7508d7a58bd..b760e7c221a 100644
> --- a/gcc/config/i386/constraints.md
> +++ b/gcc/config/i386/constraints.md
> @@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm"
>"@internal Vector memory operand."
>(match_operand 0 "vector_memory_operand"))
>
> -(define_memory_constraint "Bk"
> +(define_special_memory_constraint "Bk"
>"@internal TLS address that allows insn using non-integer registers."
>(and (match_operand 0 "memory_operand")
> (not (match_test "ix86_gpr_tls_address_pattern_p (op)"
> diff --git a/gcc/testsuite/gcc.target/i386/pr116043.c 
> b/gcc/testsuite/gcc.target/i386/pr116043.c
> new file mode 100644
> index 000..76553496c10
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116043.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512bf16 -O3" } */
> +/* { dg-final { scan-assembler-not {(?n)lea.*@gottpoff} } } */
> +
> +extern __thread int a, c, i, j, k, l;
> +int *b;
> +struct d {
> +  int e;
> +} f, g;
> +char *h;
> +
> +void m(struct d *n) {
> +  b = &k;
> +  for (; n->e; b++, n--) {
> +i = b && a;
> +if (i)
> +  j = c;
> +  }
> +}
> +
> +char *o(struct d *n) {
> +  for (; n->e;)
> +return h;
> +}
> +
> +int q() {
> +  if (l)
> +return 1;
> +  int p = *o(&g);
> +  m(&f);
> +  m(&g);
> +  l = p;
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Adjust rtx cost for imulq and imulw [PR115749]

2024-07-24 Thread Hongtao Liu
On Wed, Jul 24, 2024 at 3:11 PM Kong, Lingling  wrote:
>
> Tested spec2017 performance in Sierra Forest, Icelake, CascadeLake, at least 
> there is no obvious regression.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
>
> OK for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/x86-tune-costs.h (struct processor_costs):
> Adjust rtx_cost of imulq and imulw for COST_N_INSNS (4)
> to COST_N_INSNS (3).
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr115749.c: New test.
> ---
>  gcc/config/i386/x86-tune-costs.h | 16 
>  gcc/testsuite/gcc.target/i386/pr115749.c | 16 
>  2 files changed, 24 insertions(+), 8 deletions(-)  create mode 100644 
> gcc/testsuite/gcc.target/i386/pr115749.c
>
> diff --git a/gcc/config/i386/x86-tune-costs.h 
> b/gcc/config/i386/x86-tune-costs.h
> index 769f334e531..2bfaee554d5 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -2182,7 +2182,7 @@ struct processor_costs skylake_cost = {
>COSTS_N_INSNS (1),   /* variable shift costs */
>COSTS_N_INSNS (1),   /* constant shift costs */
>{COSTS_N_INSNS (3),  /* cost of starting multiply for QI */
> -   COSTS_N_INSNS (4),  /*   HI */
> +   COSTS_N_INSNS (3),  /*   HI */
> COSTS_N_INSNS (3),  /*   SI */
> COSTS_N_INSNS (3),  /*   DI */
> COSTS_N_INSNS (3)}, /*other */
> @@ -2310,7 +2310,7 @@ struct processor_costs icelake_cost = {
>COSTS_N_INSNS (1),   /* variable shift costs */
>COSTS_N_INSNS (1),   /* constant shift costs */
>{COSTS_N_INSNS (3),  /* cost of starting multiply for QI */
> -   COSTS_N_INSNS (4),  /*   HI */
> +   COSTS_N_INSNS (3),  /*   HI */
> COSTS_N_INSNS (3),  /*   SI */
> COSTS_N_INSNS (3),  /*   DI */
> COSTS_N_INSNS (3)}, /*other */
> @@ -2434,9 +2434,9 @@ struct processor_costs alderlake_cost = {
>COSTS_N_INSNS (1),   /* variable shift costs */
>COSTS_N_INSNS (1),   /* constant shift costs */
>{COSTS_N_INSNS (3),  /* cost of starting multiply for QI */
> -   COSTS_N_INSNS (4),  /*   HI */
> +   COSTS_N_INSNS (3),  /*   HI */
> COSTS_N_INSNS (3),  /*   SI */
> -   COSTS_N_INSNS (4),  /*   DI */
> +   COSTS_N_INSNS (3),  /*   DI */
> COSTS_N_INSNS (4)}, /*other */
>0,   /* cost of multiply per each bit set 
> */
>{COSTS_N_INSNS (16), /* cost of a divide/mod for QI */
> @@ -3234,9 +3234,9 @@ struct processor_costs tremont_cost = {
>COSTS_N_INSNS (1),   /* variable shift costs */
>COSTS_N_INSNS (1),   /* constant shift costs */
>{COSTS_N_INSNS (3),  /* cost of starting multiply for QI */
> -   COSTS_N_INSNS (4),  /*   HI */
> +   COSTS_N_INSNS (3),  /*   HI */
> COSTS_N_INSNS (3),  /*   SI */
> -   COSTS_N_INSNS (4),  /*   DI */
> +   COSTS_N_INSNS (3),  /*   DI */
> COSTS_N_INSNS (4)}, /*other */
>0,   /* cost of multiply per each bit set 
> */
>{COSTS_N_INSNS (16), /* cost of a divide/mod for QI */
> @@ -3816,9 +3816,9 @@ struct processor_costs generic_cost = {
>COSTS_N_INSNS (1),   /* variable shift costs */
>COSTS_N_INSNS (1),   /* constant shift costs */
>{COSTS_N_INSNS (3),  /* cost of starting multiply for QI */
> -   COSTS_N_INSNS (4),  /*   HI */
> +   COSTS_N_INSNS (3),  /*   HI */
> COSTS_N_INSNS (3),  /*   SI */
> -   COSTS_N_INSNS (4),  /*   DI */
> +   COSTS_N_INSNS (3),  /*   DI */
> COSTS_N_INSNS (4)}, /*other */
>0

Re: [PATCH] x86: Don't enable APX_F in 32-bit mode.

2024-07-22 Thread Hongtao Liu
On Thu, Jul 18, 2024 at 5:29 PM Kong, Lingling  wrote:
>
> I adjusted my patch based on the comments by H.J.
> And I will add the testcase like  gcc.target/i386/pr101395-1.c when the march 
> for APX is determined.
>
> Ok for trunk?
Synced with LLVM folks, they agreed to this solution.
Ok.
>
> Thanks,
> Lingling
>
> gcc/ChangeLog:
>
> PR target/115978
> * config/i386/driver-i386.cc (host_detect_local_cpu): Enable
> APX_F only for 64-bit codegen.
> * config/i386/i386-options.cc (DEF_PTA): Skip PTA_APX_F if
> not in 64-bit mode.
>
> gcc/testsuite/ChangeLog:
>
> PR target/115978
> * gcc.target/i386/pr115978-1.c: New test.
> * gcc.target/i386/pr115978-2.c: Ditto.
> ---
>  gcc/config/i386/driver-i386.cc |  3 ++-
>  gcc/config/i386/i386-options.cc|  3 ++-
>  gcc/testsuite/gcc.target/i386/pr115978-1.c | 22 ++
>  gcc/testsuite/gcc.target/i386/pr115978-2.c |  6 ++
>  4 files changed, 32 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115978-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115978-2.c
>
> diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc 
> index 11470eaea12..445f5640155 100644
> --- a/gcc/config/i386/driver-i386.cc
> +++ b/gcc/config/i386/driver-i386.cc
> @@ -900,7 +900,8 @@ const char *host_detect_local_cpu (int argc, const char 
> **argv)
> if (has_feature (isa_names_table[i].feature))
>   {
> if (codegen_x86_64
> -   || isa_names_table[i].feature != FEATURE_UINTR)
> +   || (isa_names_table[i].feature != FEATURE_UINTR
> +   && isa_names_table[i].feature != FEATURE_APX_F))
>   options = concat (options, " ",
> isa_names_table[i].option, NULL);
>   }
> diff --git a/gcc/config/i386/i386-options.cc 
> b/gcc/config/i386/i386-options.cc index 059ef3ae6ad..1c8f7835af2 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -2351,7 +2351,8 @@ ix86_option_override_internal (bool main_args_p,  
> #define DEF_PTA(NAME) \
> if (((processor_alias_table[i].flags & PTA_ ## NAME) != 0) \
> && PTA_ ## NAME != PTA_64BIT \
> -   && (TARGET_64BIT || PTA_ ## NAME != PTA_UINTR) \
> +   && (TARGET_64BIT || (PTA_ ## NAME != PTA_UINTR \
> +&& PTA_ ## NAME != PTA_APX_F))\
> && !TARGET_EXPLICIT_ ## NAME ## _P (opts)) \
>   SET_TARGET_ ## NAME (opts);
>  #include "i386-isa.def"
> diff --git a/gcc/testsuite/gcc.target/i386/pr115978-1.c 
> b/gcc/testsuite/gcc.target/i386/pr115978-1.c
> new file mode 100644
> index 000..18a1c5f153a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115978-1.c
> @@ -0,0 +1,22 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -march=native" } */
> +
> +int
> +main ()
> +{
> +  if (__builtin_cpu_supports ("apxf"))
> +{
> +#ifdef __x86_64__
> +# ifndef __APX_F__
> +  __builtin_abort ();
> +# endif
> +#else
> +# ifdef __APX_F__
> +  __builtin_abort ();
> +# endif
> +#endif
> +  return 0;
> +}
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr115978-2.c 
> b/gcc/testsuite/gcc.target/i386/pr115978-2.c
> new file mode 100644
> index 000..900d6eb096a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115978-2.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=native -mno-apxf" } */
> +
> +#ifdef __APX_F__
> +# error APX_F should be disabled
> +#endif
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386, testsuite: Fix non-Unicode character

2024-07-16 Thread Hongtao Liu
On Mon, Jul 15, 2024 at 7:24 PM Paul-Antoine Arras  wrote:
>
> This trivially fixes an incorrectly encoded character in the DejaGnu
> scan pattern.
>
> OK for trunk?
Ok.
> --
> PA



-- 
BR,
Hongtao


Re: [PATCH] i386: extend trunc{128}2{16,32,64}'s scope.

2024-07-14 Thread Hongtao Liu
On Mon, Jul 15, 2024 at 1:39 PM Hu, Lin1  wrote:
>
> Hi, all
>
> Based on actual usage, trunc{128}2{16,32,64} use some instructions from
> sse/sse3, so extend their scope to extend the scope of optimization.
>
> Bootstraped and regtest on x86-64-linux-gnu, OK for trunk?
Ok.
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> PR target/107432
> * config/i386/sse.md
> (PMOV_SRC_MODE_3_AVX2): Add TARGET_AVX2 for V4DI and V8SI.
> (PMOV_SRC_MODE_4): Add TARGET_AVX2 for V4DI.
> (trunc2): Change constraint from TARGET_AVX2 
> to
> TARGET_SSSE3.
> (trunc2): Ditto.
> (truncv2div2si2): Change constraint from TARGET_AVX2 to TARGET_SSE.
>
> gcc/testsuite/ChangeLog:
>
> PR target/107432
> * gcc.target/i386/pr107432-10.c: New test.
> ---
>  gcc/config/i386/sse.md  | 11 +++---
>  gcc/testsuite/gcc.target/i386/pr107432-10.c | 41 +
>  2 files changed, 47 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-10.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index b3b4697924b..72f3c7df297 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15000,7 +15000,8 @@ (define_expand 
> "_2_mask_store"
>"TARGET_AVX512VL")
>
>  (define_mode_iterator PMOV_SRC_MODE_3 [V4DI V2DI V8SI V4SI (V8HI 
> "TARGET_AVX512BW")])
> -(define_mode_iterator PMOV_SRC_MODE_3_AVX2 [V4DI V2DI V8SI V4SI V8HI])
> +(define_mode_iterator PMOV_SRC_MODE_3_AVX2
> + [(V4DI "TARGET_AVX2") V2DI (V8SI "TARGET_AVX2") V4SI V8HI])
>  (define_mode_attr pmov_dst_3_lower
>[(V4DI "v4qi") (V2DI "v2qi") (V8SI "v8qi") (V4SI "v4qi") (V8HI "v8qi")])
>  (define_mode_attr pmov_dst_3
> @@ -15014,7 +15015,7 @@ (define_expand "trunc2"
>[(set (match_operand: 0 "register_operand")
> (truncate:
>   (match_operand:PMOV_SRC_MODE_3_AVX2 1 "register_operand")))]
> -  "TARGET_AVX2"
> +  "TARGET_SSSE3"
>  {
>if (TARGET_AVX512VL
>&& (mode != V8HImode || TARGET_AVX512BW))
> @@ -15390,7 +15391,7 @@ (define_insn_and_split 
> "avx512vl_v8qi2_mask_store_2"
>   (match_dup 2)))]
>"operands[0] = adjust_address_nv (operands[0], V8QImode, 0);")
>
> -(define_mode_iterator PMOV_SRC_MODE_4 [V4DI V2DI V4SI])
> +(define_mode_iterator PMOV_SRC_MODE_4 [(V4DI "TARGET_AVX2") V2DI V4SI])
>  (define_mode_attr pmov_dst_4
>[(V4DI "V4HI") (V2DI "V2HI") (V4SI "V4HI")])
>  (define_mode_attr pmov_dst_4_lower
> @@ -15404,7 +15405,7 @@ (define_expand "trunc2"
>[(set (match_operand: 0 "register_operand")
> (truncate:
>   (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))]
> -  "TARGET_AVX2"
> +  "TARGET_SSSE3"
>  {
>if (TARGET_AVX512VL)
>  {
> @@ -15659,7 +15660,7 @@ (define_expand "truncv2div2si2"
>[(set (match_operand:V2SI 0 "register_operand")
> (truncate:V2SI
>   (match_operand:V2DI 1 "register_operand")))]
> -  "TARGET_AVX2"
> +  "TARGET_SSE"
>  {
>if (TARGET_AVX512VL)
>  {
> diff --git a/gcc/testsuite/gcc.target/i386/pr107432-10.c 
> b/gcc/testsuite/gcc.target/i386/pr107432-10.c
> new file mode 100644
> index 000..57edf7cfc78
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr107432-10.c
> @@ -0,0 +1,41 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=x86-64-v2 -O2" } */
> +/* { dg-final { scan-assembler-times "shufps" 1 } } */
> +/* { dg-final { scan-assembler-times "pshufb" 5 } } */
> +
> +#include 
> +
> +typedef short __v2hi __attribute__ ((__vector_size__ (4)));
> +typedef char __v2qi __attribute__ ((__vector_size__ (2)));
> +typedef char __v4qi __attribute__ ((__vector_size__ (4)));
> +typedef char __v8qi __attribute__ ((__vector_size__ (8)));
> +
> +__v2si mm_cvtepi64_epi32_builtin_convertvector(__v2di a)
> +{
> +  return __builtin_convertvector((__v2di)a, __v2si);
> +}
> +
> +__v2hi mm_cvtepi64_epi16_builtin_convertvector(__m128i a)
> +{
> +  return __builtin_convertvector((__v2di)a, __v2hi);
> +}
> +
> +__v4hi mm_cvtepi32_epi16_builtin_convertvector(__m128i a)
> +{
> +  return __builtin_convertvector((__v4si)a, __v4hi);
> +}
> +
> +__v2qi mm_cvtepi64_epi8_builtin_convertvector(__m128i a)
> +{
> +  return __builtin_convertvector((__v2di)a, __v2qi);
> +}
> +
> +__v4qi mm_cvtepi32_epi8_builtin_convertvector(__m128i a)
> +{
> +  return __builtin_convertvector((__v4si)a, __v4qi);
> +}
> +
> +__v8qi mm_cvtepi16_epi8_builtin_convertvector(__m128i a)
> +{
> +  return __builtin_convertvector((__v8hi)a, __v8qi);
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [i386] adjust flag_omit_frame_pointer in a single function [PR113719] (was: Re: [PATCH] [i386] restore recompute to override opts after change [PR113719])

2024-07-14 Thread Hongtao Liu
On Thu, Jul 11, 2024 at 9:07 PM Alexandre Oliva  wrote:
>
> On Jul  4, 2024, Alexandre Oliva  wrote:
>
> > On Jul  3, 2024, Rainer Orth  wrote:
>
> > Hmm, I wonder if leaf frame pointer has to do with that.
>
> It did, in a way.
>
> 
>
> The first two patches for PR113719 have each regressed
> gcc.dg/ipa/iinline-attr.c on a different target.  The reason for this
> instability is that there are competing flag_omit_frame_pointer
> overriders on x86:
>
> - ix86_recompute_optlev_based_flags computes and sets a
>   -f[no-]omit-frame-pointer default depending on
>   USE_IX86_FRAME_POINTER and, in 32-bit mode, optimize_size
>
> - ix86_option_override_internal enables flag_omit_frame_pointer for
>   -momit-leaf-frame-pointer to take effect
>
> ix86_option_override[_internal] calls
> ix86_recompute_optlev_based_flags before setting
> flag_omit_frame_pointer.  It is called during global process_options.
>
> But ix86_recompute_optlev_based_flags is also called by
> parse_optimize_options, during attribute processing, and at that
> point, ix86_option_override is not called, so the final overrider for
> global options is not applied to the optimize attributes.  If they
> differ, the testcase fails.
>
> In order to fix this, we need to process all overriders of this option
> whenever we process any of them.  Since this setting is affected by
> optimization options, it makes sense to compute it in
> parse_optimize_options, rather than in process_options.
>
> Regstrapped on x86_64-linux-gnu.  Also verified that the regression is
> cured with a i686-solaris cross compiler.  Ok to install?
Ok. thanks.
>
>
> for  gcc/ChangeLog
>
> PR target/113719
> * config/i386/i386-options.cc (ix86_option_override_internal):
> Move flag_omit_frame_pointer final overrider...
> (ix86_recompute_optlev_based_flags): ... here.
> ---
>  gcc/config/i386/i386-options.cc |   12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 5824c0cb072eb..059ef3ae6ad44 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1911,6 +1911,12 @@ ix86_recompute_optlev_based_flags (struct gcc_options 
> *opts,
> opts->x_flag_pcc_struct_return = DEFAULT_PCC_STRUCT_RETURN;
> }
>  }
> +
> +  /* Keep nonleaf frame pointers.  */
> +  if (opts->x_flag_omit_frame_pointer)
> +opts->x_target_flags &= ~MASK_OMIT_LEAF_FRAME_POINTER;
> +  else if (TARGET_OMIT_LEAF_FRAME_POINTER_P (opts->x_target_flags))
> +opts->x_flag_omit_frame_pointer = 1;
>  }
>
>  /* Implement part of TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook.  */
> @@ -2590,12 +2596,6 @@ ix86_option_override_internal (bool main_args_p,
>  opts->x_target_flags |= MASK_NO_RED_ZONE;
>  }
>
> -  /* Keep nonleaf frame pointers.  */
> -  if (opts->x_flag_omit_frame_pointer)
> -opts->x_target_flags &= ~MASK_OMIT_LEAF_FRAME_POINTER;
> -  else if (TARGET_OMIT_LEAF_FRAME_POINTER_P (opts->x_target_flags))
> -opts->x_flag_omit_frame_pointer = 1;
> -
>/* If we're doing fast math, we don't care about comparison order
>   wrt NaNs.  This lets us use a shorter comparison sequence.  */
>if (opts->x_flag_finite_math_only)
>
>
> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> More tolerance and less prejudice are key for inclusion and diversity
> Excluding neuro-others for not behaving ""normal"" is *not* inclusive



-- 
BR,
Hongtao


Re: [PATCH] [APX NF] Add a pass to convert legacy insn to NF insns

2024-07-14 Thread Hongtao Liu
On Wed, Jul 10, 2024 at 2:46 PM Hongyu Wang  wrote:
>
> Hi,
>
> For APX ccmp, current infrastructure will always generate cstore for
> the ccmp flag user, like
>
> cmpe%rcx, %r8
> ccmpnel %rax, %rbx
> seta%dil
> add %rcx, %r9
> add %r9, %rdx
> testb   %dil, %dil
> je  .L2
>
> For such case, the legacy add clobbers FLAGS_REG so there should have
> extra cstore to avoid the flag be reset before using it. If the
> instructions between flag producer and user are NF insns, the setcc/
> test sequence is not required.
>
> Add a pass to convert legacy flag clobber insns to their NF counterpart.
> The convertion only happens when
> 1. APX_NF enabled.
> 2. For a BB, cstore was find, and there are insns between such cstore
> and next explicit set insn to FLAGS_REG (test or cmp).
> 3. All the insns between should have NF counterpart.
>
> The pass was added after rtl-ifcvt which eliminates some branch when
> profitable, which could cause some flag-clobbering insn put between
> cstore and jcc.
>
> Bootstrapped & regtested on x86_64-pc-linux-gnu and SDE. Also passed
> spec2017 simulation run on SDE.
>
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (has_nf): New define_attr, add to all
> nf related patterns.
> * config/i386/i386-features.cc (apx_nf_convert): New function
> to convert Non-NF insns to their NF counterparts.
> (class pass_apx_nf_convert): New pass class.
> (make_pass_apx_nf_convert): New.
> * config/i386/i386-passes.def: Add pass_apx_nf_convert after
> rtl_ifcvt.
> * config/i386/i386-protos.h (make_pass_apx_nf_convert): Declare.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-nf-2.c: New test.
> ---
>  gcc/config/i386/i386-features.cc | 163 +++
>  gcc/config/i386/i386-passes.def  |   1 +
>  gcc/config/i386/i386-protos.h|   1 +
>  gcc/config/i386/i386.md  |  67 +-
>  gcc/testsuite/gcc.target/i386/apx-nf-2.c |  32 +
>  5 files changed, 259 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-nf-2.c
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index fc224ed06b0..3da56ddbdcc 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -3259,6 +3259,169 @@ make_pass_remove_partial_avx_dependency (gcc::context 
> *ctxt)
>return new pass_remove_partial_avx_dependency (ctxt);
>  }
>
> +/* Convert legacy instructions that clobbers EFLAGS to APX_NF
> +   instructions when there are no flag set between a flag
> +   producer and user.  */
> +
> +static unsigned int
> +ix86_apx_nf_convert (void)
> +{
> +  timevar_push (TV_MACH_DEP);
> +
> +  basic_block bb;
> +  rtx_insn *insn;
> +  hash_map  converting_map;
> +  auto_vec  current_convert_list;
> +
> +  bool converting_seq = false;
> +  rtx cc = gen_rtx_REG (CCmode, FLAGS_REG);
> +
> +  FOR_EACH_BB_FN (bb, cfun)
> +{
> +  /* Reset conversion for each bb.  */
> +  converting_seq = false;
> +  FOR_BB_INSNS (bb, insn)
> +   {
> + if (!NONDEBUG_INSN_P (insn))
> +   continue;
> +
> + if (recog_memoized (insn) < 0)
> +   continue;
> +
> + /* Convert candidate insns after cstore, which should
> +satisify the two conditions:
> +1. Is not flag user or producer, only clobbers
> +FLAGS_REG.
> +2. Have corresponding nf pattern.  */
> +
> + rtx pat = PATTERN (insn);
> +
> + /* Starting convertion at first cstorecc.  */
> + rtx set = NULL_RTX;
> + if (!converting_seq
> + && (set = single_set (insn))
> + && ix86_comparison_operator (SET_SRC (set), VOIDmode)
> + && reg_overlap_mentioned_p (cc, SET_SRC (set))
> + && !reg_overlap_mentioned_p (cc, SET_DEST (set)))
> +   {
> + converting_seq = true;
> + current_convert_list.truncate (0);
> +   }
> + /* Terminate at the next explicit flag set.  */
> + else if (reg_set_p (cc, pat)
> +  && GET_CODE (set_of (cc, pat)) != CLOBBER)
> +   converting_seq = false;
> +
> + if (!converting_seq)
> +   continue;
> +
> + if (get_attr_has_nf (insn)
> + && GET_CODE (pat) == PARALLEL)
> +   {
> + /* Record the insn to candidate map.  */
> + current_convert_list.safe_push (insn);
> + converting_map.put (insn, pat);
> +   }
> + /* If the insn clobbers flags but has no nf_attr,
> +revoke all previous candidates.  */
> + else if (!get_attr_has_nf (insn)
> +  && reg_set_p (cc, pat)
> +  && GET_CODE (set_of (cc, pat)) == CLOBBER)
> +   {
> + for (auto item : current_conv

Re: [PATCH] AVX512BF16: Do not allow permutation with vcvtne2ps2bf16 [PR115889]

2024-07-14 Thread Hongtao Liu
On Mon, Jul 15, 2024 at 10:21 AM Hongyu Wang  wrote:
>
> > Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b?
>
> We can still deal with BFmode permutation the same way as HFmode, so
> the change in ix86_vectorize_vec_perm_const can be preserved.
>
> Hongtao Liu  于2024年7月15日周一 09:40写道:
> >
> > On Sat, Jul 13, 2024 at 3:44 PM Hongyu Wang  wrote:
> > >
> > > Hi,
> > >
> > > According to the instruction spec of AVX512BF16, the convert from float
> > > to BF16 is not a simple truncation. It has special handling for
> > > denormal/nan, even for normal float it will add an extra bias according
> > > to the least significant bit for bf number. This means we cannot use the
> > > vcvtne2ps2bf16 for any bf16 vector shuffle.
> > > The optimization introduced in r15-1368 adds a specific split to convert
> > > HImode permutation with this instruction, so remove it and treat the
> > > BFmode permutation same as HFmode.
I see, patch LGTM.
> > >
> > > Bootstrapped & regtested on x86_64-pc-linux-gnu. OK for trunk?
> > Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/115889
> > > * config/i386/predicates.md (vcvtne2ps2bf_parallel): Remove.
> > > * config/i386/sse.md (hi_cvt_bf): Remove.
> > > (HI_CVT_BF): Likewise.
> > > (vpermt2_sepcial_bf16_shuffle_):Likewise.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR target/115889
> > > * gcc.target/i386/vpermt2-special-bf16-shufflue.c: Adjust option
> > > and output scan.
> > > ---
> > >  gcc/config/i386/predicates.md | 11 --
> > >  gcc/config/i386/sse.md| 35 ---
> > >  .../i386/vpermt2-special-bf16-shufflue.c  |  5 ++-
> > >  3 files changed, 2 insertions(+), 49 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > > index a894847adaf..5d0bb1e0f54 100644
> > > --- a/gcc/config/i386/predicates.md
> > > +++ b/gcc/config/i386/predicates.md
> > > @@ -2327,14 +2327,3 @@ (define_predicate "apx_ndd_add_memory_operand"
> > >
> > >return true;
> > >  })
> > > -
> > > -;; Check that each element is odd and incrementally increasing from 1
> > > -(define_predicate "vcvtne2ps2bf_parallel"
> > > -  (and (match_code "const_vector")
> > > -   (match_code "const_int" "a"))
> > > -{
> > > -  for (int i = 0; i < XVECLEN (op, 0); ++i)
> > > -if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1))
> > > -  return false;
> > > -  return true;
> > > -})
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index b3b4697924b..c134494cd20 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -31460,38 +31460,3 @@ (define_insn "vpdp_"
> > >"TARGET_AVXVNNIINT16"
> > >"vpdp\t{%3, %2, %0|%0, %2, %3}"
> > > [(set_attr "prefix" "vex")])
> > > -
> > > -(define_mode_attr hi_cvt_bf
> > > -  [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")])
> > > -
> > > -(define_mode_attr HI_CVT_BF
> > > -  [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")])
> > > -
> > > -(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_"
> > > -  [(set (match_operand:VI2_AVX512F 0 "register_operand")
> > > -   (unspec:VI2_AVX512F
> > > - [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel")
> > > -  (match_operand:VI2_AVX512F 2 "register_operand")
> > > -  (match_operand:VI2_AVX512F 3 "nonimmediate_operand")]
> > > -  UNSPEC_VPERMT2))]
> > > -  "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()"
> > > -  "#"
> > > -  "&& 1"
> > > -  [(const_int 0)]
> > > -{
> > > -  rtx op0 = gen_reg_rtx (mode);
> > > -  operands[2] = lowpart_subreg (mode,
> > > -   force_reg (mode, operands[2]),
> > > -   mode);
> > > -  operands[3] = lowpart_subreg (mode,
> > &g

Re: [PATCH] AVX512BF16: Do not allow permutation with vcvtne2ps2bf16 [PR115889]

2024-07-14 Thread Hongtao Liu
On Sat, Jul 13, 2024 at 3:44 PM Hongyu Wang  wrote:
>
> Hi,
>
> According to the instruction spec of AVX512BF16, the convert from float
> to BF16 is not a simple truncation. It has special handling for
> denormal/nan, even for normal float it will add an extra bias according
> to the least significant bit for bf number. This means we cannot use the
> vcvtne2ps2bf16 for any bf16 vector shuffle.
> The optimization introduced in r15-1368 adds a specific split to convert
> HImode permutation with this instruction, so remove it and treat the
> BFmode permutation same as HFmode.
>
> Bootstrapped & regtested on x86_64-pc-linux-gnu. OK for trunk?
Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b?
>
> gcc/ChangeLog:
>
> PR target/115889
> * config/i386/predicates.md (vcvtne2ps2bf_parallel): Remove.
> * config/i386/sse.md (hi_cvt_bf): Remove.
> (HI_CVT_BF): Likewise.
> (vpermt2_sepcial_bf16_shuffle_):Likewise.
>
> gcc/testsuite/ChangeLog:
>
> PR target/115889
> * gcc.target/i386/vpermt2-special-bf16-shufflue.c: Adjust option
> and output scan.
> ---
>  gcc/config/i386/predicates.md | 11 --
>  gcc/config/i386/sse.md| 35 ---
>  .../i386/vpermt2-special-bf16-shufflue.c  |  5 ++-
>  3 files changed, 2 insertions(+), 49 deletions(-)
>
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index a894847adaf..5d0bb1e0f54 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -2327,14 +2327,3 @@ (define_predicate "apx_ndd_add_memory_operand"
>
>return true;
>  })
> -
> -;; Check that each element is odd and incrementally increasing from 1
> -(define_predicate "vcvtne2ps2bf_parallel"
> -  (and (match_code "const_vector")
> -   (match_code "const_int" "a"))
> -{
> -  for (int i = 0; i < XVECLEN (op, 0); ++i)
> -if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1))
> -  return false;
> -  return true;
> -})
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index b3b4697924b..c134494cd20 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -31460,38 +31460,3 @@ (define_insn "vpdp_"
>"TARGET_AVXVNNIINT16"
>"vpdp\t{%3, %2, %0|%0, %2, %3}"
> [(set_attr "prefix" "vex")])
> -
> -(define_mode_attr hi_cvt_bf
> -  [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")])
> -
> -(define_mode_attr HI_CVT_BF
> -  [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")])
> -
> -(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_"
> -  [(set (match_operand:VI2_AVX512F 0 "register_operand")
> -   (unspec:VI2_AVX512F
> - [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel")
> -  (match_operand:VI2_AVX512F 2 "register_operand")
> -  (match_operand:VI2_AVX512F 3 "nonimmediate_operand")]
> -  UNSPEC_VPERMT2))]
> -  "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()"
> -  "#"
> -  "&& 1"
> -  [(const_int 0)]
> -{
> -  rtx op0 = gen_reg_rtx (mode);
> -  operands[2] = lowpart_subreg (mode,
> -   force_reg (mode, operands[2]),
> -   mode);
> -  operands[3] = lowpart_subreg (mode,
> -   force_reg (mode, operands[3]),
> -   mode);
> -
> -  emit_insn (gen_avx512f_cvtne2ps2bf16_(op0,
> -  operands[3],
> -  operands[2]));
> -  emit_move_insn (operands[0], lowpart_subreg (mode, op0,
> -  mode));
> -  DONE;
> -}
> -[(set_attr "mode" "")])
> diff --git a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c 
> b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
> index 5c65f2a9884..4cbc85735de 100755
> --- a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
> +++ b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
> @@ -1,7 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -mavx512bf16 -mavx512vl" } */
> -/* { dg-final { scan-assembler-not "vpermi2b" } } */
> -/* { dg-final { scan-assembler-times "vcvtne2ps2bf16" 3 } } */
> +/* { dg-options "-O2 -mavx512vbmi -mavx512vl" } */
> +/* { dg-final { scan-assembler-times "vpermi2w" 3 } } */
>
>  typedef __bf16 v8bf __attribute__((vector_size(16)));
>  typedef __bf16 v16bf __attribute__((vector_size(32)));
> --
> 2.34.1
>


-- 
BR,
Hongtao


Re: [x86 SSE PATCH] Some AVX512 ternlog expansion refinements (take #2)

2024-07-11 Thread Hongtao Liu
On Fri, Jul 12, 2024 at 5:33 AM Roger Sayle  wrote:
>
>
> Hi Hongtao,
> Thanks for the review and pointing out the remaining uses of force_reg
> that I'd overlooked.  Here's a revised version of the patch that incorporates
> your feedback.  One minor change was that rather than using
> memory_operand, which as you point out also needs to include
> bcst_mem_operand, it's simpler to invert the logic to check for
> register_operand [i.e. the first operand must be a register].
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok.
>
>
> 2024-07-11  Roger Sayle  
> Hongtao Liu  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_broadcast_from_constant):
> Use CONST_VECTOR_P instead of comparison against GET_CODE.
> (ix86_gen_bcst_mem): Likewise.
> (ix86_ternlog_leaf_p): Likewise.
> (ix86_ternlog_operand_p): ix86_ternlog_leaf_p is always true for
> vector_all_ones_operand.
> (ix86_expand_ternlog_bin_op): Use CONST_VECTOR_P instead of
> equality comparison against GET_CODE.  Replace call to force_reg
> with gen_reg_rtx and emit_move_insn (for VEC_DUPLICATE broadcast).
> Check for !register_operand instead of memory_operand.
> Support CONST_VECTORs by calling force_const_mem.
> (ix86_expand_ternlog): Fix indentation whitespace.
> Allow ix86_ternlog_leaf_p as ix86_expand_ternlog_andnot's second
> operand. Use CONST_VECTOR_P instead of equality against GET_CODE.
> Use gen_reg_rtx and emit_move_insn for ~a, ~b and ~c cases.
>
>
> Thanks again,
> Roger
>
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: 08 July 2024 02:55
> > To: Roger Sayle 
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> > Subject: Re: [x86 SSE PATCH] Some AVX512 ternlog expansion refinements.
> >
> > On Sun, Jul 7, 2024 at 5:00 PM Roger Sayle 
> > wrote:
> > > Hi Hongtao,
> > > This should address concerns about the remaining use of force_reg.
> > >
> >  51@@ -25793,15 +25792,20 @@ ix86_expand_ternlog_binop (enum rtx_code
> > code, machine_mode mode,
> >  52   if (GET_MODE (op1) != mode)
> >  53 op1 = gen_lowpart (mode, op1);
> >  54
> >  55-  if (GET_CODE (op0) == CONST_VECTOR)  56+  if (CONST_VECTOR_P (op0))
> >  57 op0 = validize_mem (force_const_mem (mode, op0));
> >  58-  if (GET_CODE (op1) == CONST_VECTOR)  59+  if (CONST_VECTOR_P (op1))
> >  60 op1 = validize_mem (force_const_mem (mode, op1));
> >  61
> >  62   if (memory_operand (op0, mode))
> >  63 {
> >  64   if (memory_operand (op1, mode))
> >  65-   op0 = force_reg (mode, op0);
> >  66+   {
> >  67+ /* We can't use force_reg (op0, mode).  */
> >  68+ rtx reg = gen_reg_rtx (mode);
> >  69+ emit_move_insn (reg, op0);
> >  70+ op0 = reg;
> >  71+   }
> > Shouldn't we handle bcst_mem_operand instead of
> > memory_operand(bcst_memory_operand is not a memory_operand)?
> > so maybe
> > if (memory_operand (op0, mode0) || bcst_mem_operand (op0, mode0)
> >   if (memory_operand (op1, mode) || bcst_mem_operand (op1, mode0)?
> >  72   else
> >  73std::swap (op0, op1);
> >  74 }
> >
> > Also there's force_reg in below 3 cases, are there any restrictions to avoid
> > bcst_mem_operand into them?
> > case 0x0f:  /* ~a */
> > case 0x33:  /* ~b */
> > case 0x33:  /* ~b */
> > ..
> >  if (!TARGET_64BIT && !register_operand (op2, mode))
> >op2 = force_reg (mode, op2);
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao


  1   2   3   4   5   6   7   8   9   10   >