Re: [PATCH] i386: Disallow long address mode in the x32 mode. [PR 117418]
On Fri, Nov 8, 2024 at 12:18 PM H.J. Lu wrote: > > On Fri, Nov 8, 2024 at 10:41 AM Hu, Lin1 wrote: > > > > Hi, all > > > > -maddress-mode=long will let Pmode = DI_mode, but -mx32 request x32 ABI. > > So raise an error to avoid ICE. > > > > Bootstrapped and regtested, OK for trunk? > > > > BRs, > > Lin > > > > gcc/ChangeLog: > > > > PR target/117418 > > * config/i386/i386-options.cc (ix86_option_override_internal): > > raise an > > error with option -mx32 -maddress-mode=long. > > > > gcc/testsuite/ChangeLog: > > > > PR target/117418 > > * gcc.target/i386/pr117418-1.c: New test. > > --- > > gcc/config/i386/i386-options.cc| 4 > > gcc/testsuite/gcc.target/i386/pr117418-1.c | 13 + > > 2 files changed, 17 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117418-1.c > > > > diff --git a/gcc/config/i386/i386-options.cc > > b/gcc/config/i386/i386-options.cc > > index 239269ecbdd..ba1abea2537 100644 > > --- a/gcc/config/i386/i386-options.cc > > +++ b/gcc/config/i386/i386-options.cc > > @@ -2190,6 +2190,10 @@ ix86_option_override_internal (bool main_args_p, > > error ("address mode %qs not supported in the %s bit mode", > >TARGET_64BIT_P (opts->x_ix86_isa_flags) ? "short" : "long", > >TARGET_64BIT_P (opts->x_ix86_isa_flags) ? "64" : "32"); > > + > > + if (TARGET_X32_P (opts->x_ix86_isa_flags) > > + && opts_set->x_ix86_pmode == PMODE_DI) > > + error ("address mode 'long' not supported in the x32 ABI"); > > This looks wrong. Try the encoded patch. > So it means -maddress-mode=long will override x32 to use 64-bit pointer? > > } > >else > > opts->x_ix86_pmode = TARGET_LP64_P (opts->x_ix86_isa_flags) > > diff --git a/gcc/testsuite/gcc.target/i386/pr117418-1.c > > b/gcc/testsuite/gcc.target/i386/pr117418-1.c > > new file mode 100644 > > index 000..08430ef9d4b > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr117418-1.c > > @@ -0,0 +1,13 @@ > > +/* PR target/117418 */ > > +/* { dg-do compile } */ > > +/* { dg-options "-maddress-mode=long -mwidekl -mx32" } */ > > +/* { dg-error "address mode 'long' not supported in the x32 ABI" "" { > > target *-*-* } 0 } */ > > + > > +typedef __attribute__((__vector_size__(16))) long long V; > > +V a; > > + > > +void > > +foo() > > +{ > > +__builtin_ia32_encodekey256_u32(0, a, a, &a); > > +} > > -- > > 2.31.1 > > > > > -- > H.J. -- BR, Hongtao
Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops
On Fri, Nov 8, 2024 at 10:21 AM Mayshao-oc wrote: > > > > -Original Message- > > > From: Xi Ruoyao > > > Sent: Thursday, November 7, 2024 1:12 PM > > > To: Liu, Hongtao ; Mayshao-oc > > o...@zhaoxin.com>; Hongtao Liu > > > Cc: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; ubiz...@gmail.com; > > > richard.guent...@gmail.com; Tim Hu(WH-RD) ; Silvia > > > Zhao(BJ-RD) ; Louis Qi(BJ-RD) > > > ; Cobe Chen(BJ-RD) > > > Subject: Re: [PATCH] [x86_64] Add microarchtecture tunable for > > > pass_align_tight_loops > > > On Thu, 2024-11-07 at 04:58 +, Liu, Hongtao wrote: > > > > > > > Hi all: > > > > > > > For zhaoxin, I find no improvement when enable > > > > > > > pass_align_tight_loops, and have performance drop in some cases. > > > > > > > This patch add a new tunable to bypass > > > > > > > pass_align_tight_loops in > > > > > zhaoxin. > > > > > > > > > > > > > > Bootstrapped X86_64. > > > > > > > Ok for trunk? > > > > LGTM. > > > > > > I'd suggest to add the reference to PR 117438 into the subject and > > > ChangeLog. > > Yes, thanks. > Add PR 117438 into the subject and ChangeLog. PR target/117438 Others LGTM. > > > > > > -- > > > Xi Ruoyao > > > School of Aerospace Science and Technology, Xidian University > BR > Mayshao -- BR, Hongtao
Re: [PATCH v4 7/8] i386: Add zero maskload else operand.
On Fri, Nov 8, 2024 at 1:58 AM Robin Dapp wrote: > > From: Robin Dapp > > gcc/ChangeLog: > > * config/i386/sse.md (maskload): > Call maskload..._1. > (maskload_1): Rename. Ok for x86 part. > --- > gcc/config/i386/sse.md | 21 ++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 22c6c817dd7..1523e2c4d75 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -28641,7 +28641,7 @@ (define_insn > "_maskstore" > (set_attr "btver2_decode" "vector") > (set_attr "mode" "")]) > > -(define_expand "maskload" > +(define_expand "maskload_1" >[(set (match_operand:V48_128_256 0 "register_operand") > (unspec:V48_128_256 > [(match_operand: 2 "register_operand") > @@ -28649,13 +28649,28 @@ (define_expand "maskload" > UNSPEC_MASKMOV))] >"TARGET_AVX") > > +(define_expand "maskload" > + [(set (match_operand:V48_128_256 0 "register_operand") > + (unspec:V48_128_256 > + [(match_operand: 2 "register_operand") > + (match_operand:V48_128_256 1 "memory_operand") > + (match_operand:V48_128_256 3 "const0_operand")] > + UNSPEC_MASKMOV))] > + "TARGET_AVX" > +{ > + emit_insn (gen_maskload_1 (operands[0], > + operands[1], > + operands[2])); > + DONE; > +}) > + > (define_expand "maskload" >[(set (match_operand:V48_AVX512VL 0 "register_operand") > (vec_merge:V48_AVX512VL > (unspec:V48_AVX512VL > [(match_operand:V48_AVX512VL 1 "memory_operand")] > UNSPEC_MASKLOAD) > - (match_dup 0) > + (match_operand:V48_AVX512VL 3 "const0_operand") > (match_operand: 2 "register_operand")))] >"TARGET_AVX512F") > > @@ -28665,7 +28680,7 @@ (define_expand "maskload" > (unspec:VI12HFBF_AVX512VL > [(match_operand:VI12HFBF_AVX512VL 1 "memory_operand")] > UNSPEC_MASKLOAD) > - (match_dup 0) > + (match_operand:VI12HFBF_AVX512VL 3 "const0_operand") > (match_operand: 2 "register_operand")))] >"TARGET_AVX512BW") > > -- > 2.47.0 > -- BR, Hongtao
Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.
On Thu, Nov 7, 2024 at 3:52 PM Jakub Jelinek wrote: > > On Thu, Nov 07, 2024 at 01:57:21PM +0800, Hongtao Liu wrote: > > > Does it turn the sNaNs into infinities or qNaNs silently? > > Yes. > > Into infinities? Into qNaNs(Sorry, I didn't see it clearly.) > > > > Given the rounding, flag_rounding_math should avoid the hw instructions, > > The default rounding mode for flag_rounding_math is rounding to > > nearest, so I assume !flag_rounding_math is not needed for the > > condition. > > flag_rounding_math is about it being ok to change the rounding mode at > runtime. So, with flag_rounding_mode you can't rely on the rounding mode > being to nearest, with !flag_rounding_mode we do rely on it. I see. > So !flag_rounding_math is needed. It is not on by default, so it isn't > a big deal... > > Jakub > -- BR, Hongtao
Re: [PATCH] i386: Add -mavx512vl for pr117304-1.c
On Thu, Nov 7, 2024 at 2:04 PM Hu, Lin1 wrote: > > > -Original Message- > > From: Liu, Hongtao > > Sent: Thursday, November 7, 2024 11:41 AM > > To: Hu, Lin1 ; gcc-patches@gcc.gnu.org > > Cc: ubiz...@gmail.com > > Subject: RE: [PATCH] i386: Add -mavx512vl for pr117304-1.c > > > > > > > > > -Original Message- > > > From: Hu, Lin1 > > > Sent: Thursday, November 7, 2024 11:03 AM > > > To: gcc-patches@gcc.gnu.org > > > Cc: Liu, Hongtao ; ubiz...@gmail.com > > > Subject: [PATCH] i386: Add -mavx512vl for pr117304-1.c > > > > > > Hi, all > > > > > > Testing pr117304-1.c in a machine with only avx2 generates some > > > different hints, so add -mavx512vl at its option list. > > Didn't quite understand, what kind of hint it is, why avx512vl is needed? > > When I cherry-pick this patch based releases/gcc-14, I found if without > -mavx512vl, the hint will be __builtin_ia32_cvtdq2ps256 rather than > __builtin_ia32_cvtudq2ps128_mask. Based on lookup_name_fuzzy's comment " Look > for the closest match for NAME within the currently valid scopes.", I think > the hint is right. And the trunk's hint is wrong only with -mavx512f > -mevex512. To avoid someone change back the hint output, so I want to add the > option -mavx512vl, so the hint is right for now. Can we use regexp in the hint to avoid any change in the future? > > BRs, > Lin > > > > > > > Bootstrapped and regtested on x86-64-pc-linux-gnu. > > > I think it is an obvious commit, but I still waiting for some while. > > > If someone have other suggestion. > > > > > > BRs, > > > Lin > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/i386/pr117304-1.c: Add -mavx512vl. > > > --- > > > gcc/testsuite/gcc.target/i386/pr117304-1.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr117304-1.c > > > b/gcc/testsuite/gcc.target/i386/pr117304-1.c > > > index fc1c5bfd3e3..da26f4bd1b7 100644 > > > --- a/gcc/testsuite/gcc.target/i386/pr117304-1.c > > > +++ b/gcc/testsuite/gcc.target/i386/pr117304-1.c > > > @@ -1,6 +1,6 @@ > > > /* PR target/117304 */ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O2 -mavx512f -mno-evex512" } */ > > > +/* { dg-options "-O2 -mavx512f -mno-evex512 -mavx512vl" } */ > > > > > > typedef __attribute__((__vector_size__(32))) int __v8si; typedef > > > __attribute__((__vector_size__(32))) unsigned int __v8su; > > > -- > > > 2.31.1 > -- BR, Hongtao
Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.
On Tue, Nov 5, 2024 at 5:19 PM Jakub Jelinek wrote: > > On Tue, Nov 05, 2024 at 05:12:56PM +0800, Hongtao Liu wrote: > > Yes, there's a mismatch between scalar and vector code, I assume users > > may not care much about precision/NAN/INF/denormal behaviors for > > vector code. > > Just like we support > > #define RECIP_MASK_DEFAULT (RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT) > > but turn off > > RECIP_MASK_DIV | RECIP_MASK_SQRT. > > Users who don't care should be using -ffast-math. Users who do care > should get proper behavior. > > > > I don't know what exactly the hw instructions do, whether they perform > > > everything needed properly or just subset of it or none of it, > > > > Subset of it, hw instruction doesn't raise exceptions and always round > > to nearest (even). Output denormals are always flushed to zero and > > input denormals are always treated as zero. MXCSR is not consulted nor > > updated. > > Does it turn the sNaNs into infinities or qNaNs silently? Yes. > Given the rounding, flag_rounding_math should avoid the hw instructions, The default rounding mode for flag_rounding_math is rounding to nearest, so I assume !flag_rounding_math is not needed for the condition. > and either HONOR_NANS or HONOR_SNANS should be used to predicate that. > > > > but the permutation fallback IMHO definitely needs to be guarded with > > > the same flags as scalar code. > > > For HONOR_NANS case or flag_rounding_math, the generic code (see expr.cc) > > > uses the libgcc fallback. Otherwise, generic code has > > > /* If we don't expect qNaNs nor sNaNs and can assume rounding > > > to nearest, we can expand the conversion inline as > > > (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16. */ > > > and the backend has > > > TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode) > > > shift (i.e. just the permutation). > > > Note, even that (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16 > > > is doable in vectors. > > > > If you're concerned about that, I'll commit another patch to align the > > condition of the vector expander with scalar ones for both extendmn2 > > and truncmn2. > > For the fallback, for HONOR_NANS or flag_rounding_math we just shouldn't > use the fallback at all. For flag_unsafe_math_optimizations, we can just > use the simple permutation, i.ew. fromi >> 16, otherwise can use that > (fromi + 0x7fff + ((fromi >> 16) & 1) followed by the permutation. > > Jakub > -- BR, Hongtao
Re: [PATCH] [x86_64] Add microarchtecture tunable for pass_align_tight_loops
On Thu, Nov 7, 2024 at 10:29 AM MayShao-oc wrote: > > Hi all: >For zhaoxin, I find no improvement when enable pass_align_tight_loops, > and have performance drop in some cases. >This patch add a new tunable to bypass pass_align_tight_loops in zhaoxin. > >Bootstrapped X86_64. >Ok for trunk? > BR > Mayshao > gcc/ChangeLog: > > * config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS): > default true in all processors except for zhaoxin. > * config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro. > * config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS): > New tune > --- > gcc/config/i386/i386-features.cc | 4 +++- > gcc/config/i386/i386.h | 3 +++ > gcc/config/i386/x86-tune.def | 4 > 3 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/gcc/config/i386/i386-features.cc > b/gcc/config/i386/i386-features.cc > index e2e85212a4f..d9fd92964fe 100644 > --- a/gcc/config/i386/i386-features.cc > +++ b/gcc/config/i386/i386-features.cc > @@ -3620,7 +3620,9 @@ public: >/* opt_pass methods: */ >bool gate (function *) final override > { > - return optimize && optimize_function_for_speed_p (cfun); > + return TARGET_ALIGN_TIGHT_LOOPS > +&& optimize > +&& optimize_function_for_speed_p (cfun); > } > >unsigned int execute (function *) final override > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index 2dcd8803a08..7f9010246c2 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -466,6 +466,9 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; > #define TARGET_USE_RCR ix86_tune_features[X86_TUNE_USE_RCR] > #define TARGET_SSE_MOVCC_USE_BLENDV \ > ix86_tune_features[X86_TUNE_SSE_MOVCC_USE_BLENDV] > +#define TARGET_ALIGN_TIGHT_LOOPS \ > +ix86_tune_features[X86_TUNE_ALIGN_TIGHT_LOOPS] > + > > /* Feature tests against the various architecture variations. */ > enum ix86_arch_indices { > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 6ebb2fd3414..bd4fa8b3eee 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -542,6 +542,10 @@ DEF_TUNE (X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD, > DEF_TUNE (X86_TUNE_SSE_MOVCC_USE_BLENDV, > "sse_movcc_use_blendv", ~m_CORE_ATOM) > > +/* X86_TUNE_ALIGN_TIGHT_LOOPS: if false, tight loops are not aligned. */ > +DEF_TUNE (X86_TUNE_ALIGN_TIGHT_LOOPS, "align_tight_loops", > +~(m_ZHAOXIN)) Please also add ~(m_ZHAOXIN | m_CASCADELAKE | m_SKYLAKE_AVX512)) And could you put it under the section of /*/ -/* Branch predictor tuning */ +/* Branch predictor and The Front-end tuning */ /*/ > + > > /*/ > /* AVX instruction selection tuning (some of SSE flags affects AVX, too) > */ > > /*/ > -- > 2.27.0 > -- BR, Hongtao
Re: [PATCH] testsuite: Fix up pr116725.c test [PR116725]
On Wed, Nov 6, 2024 at 4:59 PM Jakub Jelinek wrote: > > On Fri, Oct 18, 2024 at 02:05:59PM -0400, Antoni Boucher wrote: > > PR target/116725 > > * gcc.target/i386/pr116725.c: Add test using those AVX builtins. > > This test FAILs for me, as I don't have the latest gas around and the test > is dg-do assemble, so doesn't need just fixed compiler, but also assembler > which supports those instructions. > > The following patch adds effective target directives to ensure assembler > supports those too. > > Tested on x86_64-linux, ok for trunk? Ok. > > 2024-11-06 Jakub Jelinek > > PR target/116725 > * gcc.target/i386/pr116725.c: Add dg-require-effective-target > avx512{dq,fp16,vl}. > > --- gcc/testsuite/gcc.target/i386/pr116725.c.jj 2024-11-05 22:07:12.588795051 > +0100 > +++ gcc/testsuite/gcc.target/i386/pr116725.c2024-11-06 09:54:41.545064629 > +0100 > @@ -2,6 +2,9 @@ > /* { dg-do assemble } */ > /* { dg-options "-masm=intel -mavx512dq -mavx512fp16 -mavx512vl" } */ > /* { dg-require-effective-target masm_intel } */ > +/* { dg-require-effective-target avx512dq } */ > +/* { dg-require-effective-target avx512fp16 } */ > +/* { dg-require-effective-target avx512vl } */ > > #include > > Jakub > -- BR, Hongtao
Re: [PATCH] i386: Add OPTION_MASK_ISA2_EVEX512 for some AVX512 instructions.
On Wed, Nov 6, 2024 at 10:35 AM Hu, Lin1 wrote: > > Hi, all > > This patch aims to add OPTION_MASK_ISA2_EVEX512 for all avx512 512-bits > builtin functions, raise error when these builtin functions are used with > -mno-evex512. > > Bootstrapped and Regtested on x86-64-pc-linux-gnu, OK for trunk and backport > to > GCC14? > > BRs, > Lin > > gcc/ChangeLog: > > PR target/117304 > * config/i386/i386-builtin.def: Add OPTION_MASK_ISA2_EVEX512 for some > AVX512 512-bits instructions. > > gcc/testsuite/ChangeLog: > > PR target/117304 > * gcc.target/i386/pr117304-1.c: New test. > --- > gcc/config/i386/i386-builtin.def | 10 > gcc/testsuite/gcc.target/i386/pr117304-1.c | 28 ++ > 2 files changed, 33 insertions(+), 5 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr117304-1.c > > diff --git a/gcc/config/i386/i386-builtin.def > b/gcc/config/i386/i386-builtin.def > index c484e6dc29e..26c23780b1c 100644 > --- a/gcc/config/i386/i386-builtin.def > +++ b/gcc/config/i386/i386-builtin.def > @@ -3357,11 +3357,11 @@ BDESC (OPTION_MASK_ISA_AVX512F, 0, > CODE_FOR_sse_cvtsi2ss_round, "__builtin_ia32_ > BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, > CODE_FOR_sse_cvtsi2ssq_round, "__builtin_ia32_cvtsi2ss64", > IX86_BUILTIN_CVTSI2SS64, UNKNOWN, (int) V4SF_FTYPE_V4SF_INT64_INT) > BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_sse2_cvtss2sd_round, > "__builtin_ia32_cvtss2sd_round", IX86_BUILTIN_CVTSS2SD_ROUND, UNKNOWN, (int) > V2DF_FTYPE_V2DF_V4SF_INT) > BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_sse2_cvtss2sd_mask_round, > "__builtin_ia32_cvtss2sd_mask_round", IX86_BUILTIN_CVTSS2SD_MASK_ROUND, > UNKNOWN, (int) V2DF_FTYPE_V2DF_V4SF_V2DF_UQI_INT) > -BDESC (OPTION_MASK_ISA_AVX512F, 0, > CODE_FOR_unspec_fix_truncv8dfv8si2_mask_round, > "__builtin_ia32_cvttpd2dq512_mask", IX86_BUILTIN_CVTTPD2DQ512, UNKNOWN, (int) > V8SI_FTYPE_V8DF_V8SI_QI_INT) > -BDESC (OPTION_MASK_ISA_AVX512F, 0, > CODE_FOR_unspec_fixuns_truncv8dfv8si2_mask_round, > "__builtin_ia32_cvttpd2udq512_mask", IX86_BUILTIN_CVTTPD2UDQ512, UNKNOWN, > (int) V8SI_FTYPE_V8DF_V8SI_QI_INT) > -BDESC (OPTION_MASK_ISA_AVX512F, 0, > CODE_FOR_unspec_fix_truncv16sfv16si2_mask_round, > "__builtin_ia32_cvttps2dq512_mask", IX86_BUILTIN_CVTTPS2DQ512, UNKNOWN, (int) > V16SI_FTYPE_V16SF_V16SI_HI_INT) > -BDESC (OPTION_MASK_ISA_AVX512F, 0, > CODE_FOR_unspec_fixuns_truncv16sfv16si2_mask_round, > "__builtin_ia32_cvttps2udq512_mask", IX86_BUILTIN_CVTTPS2UDQ512, UNKNOWN, > (int) V16SI_FTYPE_V16SF_V16SI_HI_INT) > -BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_floatunsv16siv16sf2_mask_round, > "__builtin_ia32_cvtudq2ps512_mask", IX86_BUILTIN_CVTUDQ2PS512, UNKNOWN, (int) > V16SF_FTYPE_V16SI_V16SF_HI_INT) > +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, > CODE_FOR_unspec_fix_truncv8dfv8si2_mask_round, > "__builtin_ia32_cvttpd2dq512_mask", IX86_BUILTIN_CVTTPD2DQ512, UNKNOWN, (int) > V8SI_FTYPE_V8DF_V8SI_QI_INT) > +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, > CODE_FOR_unspec_fixuns_truncv8dfv8si2_mask_round, > "__builtin_ia32_cvttpd2udq512_mask", IX86_BUILTIN_CVTTPD2UDQ512, UNKNOWN, > (int) V8SI_FTYPE_V8DF_V8SI_QI_INT) > +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, > CODE_FOR_unspec_fix_truncv16sfv16si2_mask_round, > "__builtin_ia32_cvttps2dq512_mask", IX86_BUILTIN_CVTTPS2DQ512, UNKNOWN, (int) > V16SI_FTYPE_V16SF_V16SI_HI_INT) > +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, > CODE_FOR_unspec_fixuns_truncv16sfv16si2_mask_round, > "__builtin_ia32_cvttps2udq512_mask", IX86_BUILTIN_CVTTPS2UDQ512, UNKNOWN, > (int) V16SI_FTYPE_V16SF_V16SI_HI_INT) > +BDESC (OPTION_MASK_ISA_AVX512F, OPTION_MASK_ISA2_EVEX512, > CODE_FOR_floatunsv16siv16sf2_mask_round, "__builtin_ia32_cvtudq2ps512_mask", > IX86_BUILTIN_CVTUDQ2PS512, UNKNOWN, (int) V16SF_FTYPE_V16SI_V16SF_HI_INT) > BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, > CODE_FOR_cvtusi2sd64_round, "__builtin_ia32_cvtusi2sd64", > IX86_BUILTIN_CVTUSI2SD64, UNKNOWN, (int) V2DF_FTYPE_V2DF_UINT64_INT) > BDESC (OPTION_MASK_ISA_AVX512F, 0, CODE_FOR_cvtusi2ss32_round, > "__builtin_ia32_cvtusi2ss32", IX86_BUILTIN_CVTUSI2SS32, UNKNOWN, (int) > V4SF_FTYPE_V4SF_UINT_INT) > BDESC (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_64BIT, 0, > CODE_FOR_cvtusi2ss64_round, "__builtin_ia32_cvtusi2ss64", > IX86_BUILTIN_CVTUSI2SS64, UNKNOWN, (int) V4SF_FTYPE_V4SF_UINT64_INT) > diff --git a/gcc/testsuite/gcc.target/i386/pr117304-1.c > b/gcc/testsuite/gcc.target/i386/pr117304-1.c > new file mode 100644 > index 000..68419338524 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr117304-1.c > @@ -0,0 +1,28 @@ > +/* PR target/117304 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx10.1 -mno-evex512" } */ Please use -mavx512f -mno-evex512 to avoid warning when gcc is configured --with-arch=native on avx512 machine. Otherwise LGTM. > + > +typedef __attribute__((__vecto
Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt
On Tue, Nov 5, 2024 at 5:50 PM Mayshao-oc wrote: > > > > > > > > On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao wrote: > > > > > > > > > > > > > -Original Message- > > > > From: MayShao-oc > > > > Sent: Tuesday, November 5, 2024 11:20 AM > > > > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao > > > > ; ubiz...@gmail.com > > > > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com; > > > > cobec...@zhaoxin.com > > > > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt > > > > > > > > Hi all: > > > > This patch add -malign-tight-loops flag to control > > > > pass_align_tight_loops. > > > > The motivation is that pass_align_tight_loops may cause performance > > > > regression in nested loops. > > > > > > > > The example code as follows: > > > > > > > > #define ITER 2 > > > > #define ITER_O 10 > > > > > > > > int i, j,k; > > > > int array[ITER]; > > > > > > > > void loop() > > > > { > > > > int i; > > > > for(k = 0; k < ITER_O; k++) > > > > for(j = 0; j < ITER; j++) > > > > for(i = 0; i < ITER; i++) > > > > { > > > > array[i] += j; > > > > array[i] += i; > > > > array[i] += 2*j; > > > > array[i] += 2*i; > > > > } > > > > } > > > > > > > > When I compile it with gcc -O1 loop.c, the output assembly as > > > > follows. > > > > It is not optimal, because of too many nops insert in the outer loop. > > > > > > > > 00400540 : > > > > 400540: 48 83 ec 08 sub$0x8,%rsp > > > > 400544: bf 0a 00 00 00 mov$0xa,%edi > > > > 400549: b9 00 00 00 00 mov$0x0,%ecx > > > > 40054e: 8d 34 09lea(%rcx,%rcx,1),%esi > > > > 400551: b8 00 00 00 00 mov$0x0,%eax > > > > 400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 40055d: 00 00 00 00 > > > > 400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 400568: 00 00 00 00 > > > > 40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 400573: 00 00 00 00 > > > > 400577: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) > > > > 40057e: 00 00 > > > > 400580: 89 ca mov%ecx,%edx > > > > 400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx > > > > 400589: 01 c2 add%eax,%edx > > > > 40058b: 01 f2 add%esi,%edx > > > > 40058d: 8d 14 42lea(%rdx,%rax,2),%edx > > > > 400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4) > > > > 400597: 48 83 c0 01 add$0x1,%rax > > > > 40059b: 48 3d 20 4e 00 00 cmp$0x4e20,%rax > > > > 4005a1: 75 dd jne400580 > > > > > > > >I benchmark this program in the intel Xeon, and find the > > > > optimization may > > > > cause a 40% performance regression (6.6B cycles VS 9.3B cycles). > > On SPR, align is 25% better than no_align case. > > I found no_align is 10% better in zhaoxin yongfeng, so I test this > program in Xeon 4210R, and found a > 40% performance regression.So I think this maybe a general regression, and > need a flag to control.As you say, > On SPR, align is better,so its not a general regression. > Could you please benchmark this in Xeon 4210R, or a similar arch? > I am not delve into intel 4210R arch, and I think a 40% drop is not > explainable. Maybe I make a mistakes. > I could confirm in zhaoxin yongfeng, no_align is 10% better. > I attach the Xeon 4210R cpuinfo, and the test binary for your > reference. Thanks. I reproduce with 30% regression on CLX, there's more frontend-bound with aligned case, it's uarch specific, will make it a uarch tune. > > > > > >So I propose to add -malign-tight-loops flag to control tight loop > > > > optimization to avoid this, we could disalbe this optimization by > > > > default. > > > >Bootstrapped X86_64. > > > >Ok for trunk? > > > > > > > > BR > > > > Mayshao > > > > > > > > gcc/ChangeLog: > > > > > > > > * config/i386/i386-features.cc (ix86_align_tight_loops): New flag. > > > > * config/i386/i386.opt (malign-tight-loops): New option. > > > > * doc/invoke.texi (-malign-tight-loops): Document. > > > > --- > > > > gcc/config/i386/i386-features.cc | 4 +++- > > > > gcc/config/i386/i386.opt | 4 > > > > gcc/doc/invoke.texi | 7 ++- > > > > 3 files changed, 13 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386- > > > > features.cc > > > > index e2e85212a4f..f9546e00b07 100644 > > > > --- a/gcc/config/i386/i386-features.cc > > > > +++ b/gcc/config/i386/i386-features.cc > > > > @@ -3620,7 +3620,9 @@ public: > > > >/* opt_pass methods: */ > > > >bool gate (funct
Re: [PATCH] gcc.target/i386/apx-ndd.c: Also scan (%edi)
On Wed, Nov 6, 2024 at 8:19 AM H.J. Lu wrote: > > Since x32 uses (%edi), instead of (%rdi), also scan (%edi). > > * gcc.target/i386/apx-ndd.c: Also scan (%edi). Ok. > > -- > H.J. -- BR, Hongtao
Re: [PATCH] Intel MOVRS tests: Also scan (%e.x)
On Wed, Nov 6, 2024 at 8:21 AM H.J. Lu wrote: > > Since x32 uses (%reg32), instead of (%r.x), also scan (%e.x). > > * gcc.target/i386/avx10_2-512-movrs-1.c: Also scan (%e.x). > * gcc.target/i386/avx10_2-movrs-1.c: Likewise. > * gcc.target/i386/movrs-1.c: Likewise. Ok. > > -- > H.J. -- BR, Hongtao
Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt
On Tue, Nov 5, 2024 at 5:33 PM Richard Biener wrote: > > On Tue, Nov 5, 2024 at 8:12 AM Hongtao Liu wrote: > > > > On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao wrote: > > > > > > > > > > > > > -Original Message- > > > > From: MayShao-oc > > > > Sent: Tuesday, November 5, 2024 11:20 AM > > > > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao > > > > ; ubiz...@gmail.com > > > > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com; > > > > cobec...@zhaoxin.com > > > > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt > > > > > > > > Hi all: > > > > This patch add -malign-tight-loops flag to control > > > > pass_align_tight_loops. > > > > The motivation is that pass_align_tight_loops may cause performance > > > > regression in nested loops. > > > > > > > > The example code as follows: > > > > > > > > #define ITER 2 > > > > #define ITER_O 10 > > > > > > > > int i, j,k; > > > > int array[ITER]; > > > > > > > > void loop() > > > > { > > > > int i; > > > > for(k = 0; k < ITER_O; k++) > > > > for(j = 0; j < ITER; j++) > > > > for(i = 0; i < ITER; i++) > > > > { > > > > array[i] += j; > > > > array[i] += i; > > > > array[i] += 2*j; > > > > array[i] += 2*i; > > > > } > > > > } > > > > > > > > When I compile it with gcc -O1 loop.c, the output assembly as > > > > follows. > > > > It is not optimal, because of too many nops insert in the outer loop. > > > > > > > > 00400540 : > > > > 400540: 48 83 ec 08 sub$0x8,%rsp > > > > 400544: bf 0a 00 00 00 mov$0xa,%edi > > > > 400549: b9 00 00 00 00 mov$0x0,%ecx > > > > 40054e: 8d 34 09lea(%rcx,%rcx,1),%esi > > > > 400551: b8 00 00 00 00 mov$0x0,%eax > > > > 400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 40055d: 00 00 00 00 > > > > 400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 400568: 00 00 00 00 > > > > 40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > > > 400573: 00 00 00 00 > > > > 400577: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) > > > > 40057e: 00 00 > > > > 400580: 89 ca mov%ecx,%edx > > > > 400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx > > > > 400589: 01 c2 add%eax,%edx > > > > 40058b: 01 f2 add%esi,%edx > > > > 40058d: 8d 14 42lea(%rdx,%rax,2),%edx > > > > 400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4) > > > > 400597: 48 83 c0 01 add$0x1,%rax > > > > 40059b: 48 3d 20 4e 00 00 cmp$0x4e20,%rax > > > > 4005a1: 75 dd jne400580 > > > > > > > >I benchmark this program in the intel Xeon, and find the > > > > optimization may > > > > cause a 40% performance regression (6.6B cycles VS 9.3B cycles). > > On SPR, align is 25% better than no_align case. > > that would ask for a tunable rather than a new flag then? Good idea. > > Not knowing much about the pass and how it affects -falign-loops=N - is that > flag still honored when the pass is switched off? The pass will rewrite -falign-loops=N whenever the loop size is smaller than cache line. Otherwise -falign-loops=N is still honored. > > > > > > >So I propose to add -malign-tight-loops flag to control tight loop > > > > optimization to avoid this, we could disalbe this optimization by > > > > default. > > > >Bootstrapped X86_64. > > > >Ok for trunk? > > > > > > > > BR > > > > Mayshao > > > > > > > > gcc/ChangeLog: > > > > > > > > * config/i386/i386-features.cc (ix86_align_tight_loops): New flag. > > > > * config/i386/i386.opt (malign-tight-loop
Re: [PATCH 1/2] [x86] Support vector float_truncate for SF to BF.
On Tue, Nov 5, 2024 at 4:46 PM Jakub Jelinek wrote: > > On Tue, Oct 29, 2024 at 07:19:38PM -0700, liuhongt wrote: > > Generate native instruction whenever possible, otherwise use vector > > permutation with odd indices. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ready push to trunk. > > > > gcc/ChangeLog: > > > > * config/i386/i386-expand.cc > > (ix86_expand_vector_sf2bf_with_vec_perm): New function. > > * config/i386/i386-protos.h > > (ix86_expand_vector_sf2bf_with_vec_perm): New declare. > > * config/i386/mmx.md (truncv2sfv2bf2): New expander. > > * config/i386/sse.md (truncv4sfv4bf2): Ditto. > > (truncv8sfv8bf2): Ditto. > > (truncv16sfv16bf2): Ditto. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/avx512bf16-truncsfbf.c: New test. > > * gcc.target/i386/avx512bw-truncsfbf.c: New test. > > * gcc.target/i386/ssse3-truncsfbf.c: New test. > > Is that correct for non-ffast-math? > I mean, truncation from SF to BFmode e.g. when honoring NaNs definitely > isn't a simple permutation. > A SFmode sNaN which has non-zero bits in the mantissa only in the lower > 16-bits would be silently turned into +-Inf rather than raise exception > and turn it into a qNaN. > Similarly, the result when not using -ffast-math needs to be correctly > rounded (according to the current rounding mode, at least with > -frounding-math, otherwise at least for round to even), permutation > definitely doesn't achieve that. Yes, there's a mismatch between scalar and vector code, I assume users may not care much about precision/NAN/INF/denormal behaviors for vector code. Just like we support #define RECIP_MASK_DEFAULT (RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT) but turn off RECIP_MASK_DIV | RECIP_MASK_SQRT. > > I don't know what exactly the hw instructions do, whether they perform > everything needed properly or just subset of it or none of it, Subset of it, hw instruction doesn't raise exceptions and always round to nearest (even). Output denormals are always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated. > but the permutation fallback IMHO definitely needs to be guarded with > the same flags as scalar code. > For HONOR_NANS case or flag_rounding_math, the generic code (see expr.cc) > uses the libgcc fallback. Otherwise, generic code has > /* If we don't expect qNaNs nor sNaNs and can assume rounding > to nearest, we can expand the conversion inline as > (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16. */ > and the backend has > TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode) > shift (i.e. just the permutation). > Note, even that (fromi + 0x7fff + ((fromi >> 16) & 1)) >> 16 > is doable in vectors. If you're concerned about that, I'll commit another patch to align the condition of the vector expander with scalar ones for both extendmn2 and truncmn2. > > Jakub > -- BR, Hongtao
Re: [PATCH] [x86_64] Add flag to control tight loops alignment opt
On Tue, Nov 5, 2024 at 2:34 PM Liu, Hongtao wrote: > > > > > -Original Message- > > From: MayShao-oc > > Sent: Tuesday, November 5, 2024 11:20 AM > > To: gcc-patches@gcc.gnu.org; hubi...@ucw.cz; Liu, Hongtao > > ; ubiz...@gmail.com > > Cc: ti...@zhaoxin.com; silviaz...@zhaoxin.com; loui...@zhaoxin.com; > > cobec...@zhaoxin.com > > Subject: [PATCH] [x86_64] Add flag to control tight loops alignment opt > > > > Hi all: > > This patch add -malign-tight-loops flag to control > > pass_align_tight_loops. > > The motivation is that pass_align_tight_loops may cause performance > > regression in nested loops. > > > > The example code as follows: > > > > #define ITER 2 > > #define ITER_O 10 > > > > int i, j,k; > > int array[ITER]; > > > > void loop() > > { > > int i; > > for(k = 0; k < ITER_O; k++) > > for(j = 0; j < ITER; j++) > > for(i = 0; i < ITER; i++) > > { > > array[i] += j; > > array[i] += i; > > array[i] += 2*j; > > array[i] += 2*i; > > } > > } > > > > When I compile it with gcc -O1 loop.c, the output assembly as follows. > > It is not optimal, because of too many nops insert in the outer loop. > > > > 00400540 : > > 400540: 48 83 ec 08 sub$0x8,%rsp > > 400544: bf 0a 00 00 00 mov$0xa,%edi > > 400549: b9 00 00 00 00 mov$0x0,%ecx > > 40054e: 8d 34 09lea(%rcx,%rcx,1),%esi > > 400551: b8 00 00 00 00 mov$0x0,%eax > > 400556: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > 40055d: 00 00 00 00 > > 400561: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > 400568: 00 00 00 00 > > 40056c: 66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1) > > 400573: 00 00 00 00 > > 400577: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) > > 40057e: 00 00 > > 400580: 89 ca mov%ecx,%edx > > 400582: 03 14 85 60 10 60 00add0x601060(,%rax,4),%edx > > 400589: 01 c2 add%eax,%edx > > 40058b: 01 f2 add%esi,%edx > > 40058d: 8d 14 42lea(%rdx,%rax,2),%edx > > 400590: 89 14 85 60 10 60 00mov%edx,0x601060(,%rax,4) > > 400597: 48 83 c0 01 add$0x1,%rax > > 40059b: 48 3d 20 4e 00 00 cmp$0x4e20,%rax > > 4005a1: 75 dd jne400580 > > > >I benchmark this program in the intel Xeon, and find the optimization may > > cause a 40% performance regression (6.6B cycles VS 9.3B cycles). On SPR, align is 25% better than no_align case. > >So I propose to add -malign-tight-loops flag to control tight loop > > optimization to avoid this, we could disalbe this optimization by default. > >Bootstrapped X86_64. > >Ok for trunk? > > > > BR > > Mayshao > > > > gcc/ChangeLog: > > > > * config/i386/i386-features.cc (ix86_align_tight_loops): New flag. > > * config/i386/i386.opt (malign-tight-loops): New option. > > * doc/invoke.texi (-malign-tight-loops): Document. > > --- > > gcc/config/i386/i386-features.cc | 4 +++- > > gcc/config/i386/i386.opt | 4 > > gcc/doc/invoke.texi | 7 ++- > > 3 files changed, 13 insertions(+), 2 deletions(-) > > > > diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386- > > features.cc > > index e2e85212a4f..f9546e00b07 100644 > > --- a/gcc/config/i386/i386-features.cc > > +++ b/gcc/config/i386/i386-features.cc > > @@ -3620,7 +3620,9 @@ public: > >/* opt_pass methods: */ > >bool gate (function *) final override > > { > > - return optimize && optimize_function_for_speed_p (cfun); > > + return ix86_align_tight_loops > > +&& optimize > > +&& optimize_function_for_speed_p (cfun); > > } > > > >unsigned int execute (function *) final override diff --git > > a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index > > 64c295d344c..ec41de192bc 100644 > > --- a/gcc/config/i386/i386.opt > > +++ b/gcc/config/i386/i386.opt > > @@ -1266,6 +1266,10 @@ mlam= > > Target RejectNegative Joined Enum(lam_type) Var(ix86_lam_type) > > Init(lam_none) -mlam=[none|u48|u57] Instrument meta data position in > > user data pointers. > > > > +malign-tight-loops > > +Target Var(ix86_align_tight_loops) Init(0) Optimization Enable align > > +tight loops. > > I'd like it to be on by default, so Init (1)? > > > + > > Enum > > Name(lam_type) Type(enum lam_type) UnknownError(unknown lam > > type %qs) > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index > > 07920e07b4d..9ec1e1f0095 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -1510,7 +1510,7 @@ See RS/6000 and PowerPC Options. > > -mindirect-branch=@var{choice} -mfunction-return=@var{choice} - > > mindirect-branch-register -mharden-sls=@var
Re: [PATCH v2] i386: Handling exception input of __builtin_ia32_prefetch. [PR117416]
On Tue, Nov 5, 2024 at 2:41 PM Hu, Lin1 wrote: > > > -Original Message- > > From: Hu, Lin1 > > Sent: Tuesday, November 5, 2024 1:34 PM > > To: gcc-patches@gcc.gnu.org > > Cc: Liu, Hongtao ; ubiz...@gmail.com > > Subject: [PATCH v2] i386: Handling exception input of > > __builtin_ia32_prefetch. [PR117416] > > > > Add handler for op3, and the previously stated fail is a random fail not > > related > > to this change, OK for trunk? > > > > The fail mentioned here is gcc.dg/tortune/convert-dfp.c triggered by test > environment > > Its output is "i386 architecture of input file > `./convert-dfp.ltrans0.ltrans.o' is incompatible with i386:x86-64 output." > > When I test my patch in another test environment, the fail disappeared, it > looks like the fail isn't related to this patch from > the test result and the output. I think this part of the change is safe. Ok for the commit. > > BRs, > Lin > > > > > op1 should be between 0 and 2. Add an error handler, and op3 should be 0 or > > 1, raise a warning, when op3 is an invalid value. > > > > gcc/ChangeLog: > > > > PR target/117416 > > * config/i386/i386-expand.cc (ix86_expand_builtin): Raise warning > > when > > op1 isn't in range of [0, 2] and set op1 as const0_rtx, and raise > > warning when op3 isn't in range of [0, 1]. > > > > gcc/testsuite/ChangeLog: > > > > PR target/117416 > > * gcc.target/i386/pr117416-1.c: New test. > > * gcc.target/i386/pr117416-2.c: Ditto. > > --- > > gcc/config/i386/i386-expand.cc | 11 +++ > > gcc/testsuite/gcc.target/i386/pr117416-1.c | 12 > > gcc/testsuite/gcc.target/i386/pr117416-2.c | 12 > > 3 files changed, 35 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-2.c > > > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > > index 515334aa5a3..fcd4b3b67b7 100644 > > --- a/gcc/config/i386/i386-expand.cc > > +++ b/gcc/config/i386/i386-expand.cc > > @@ -14194,6 +14194,13 @@ ix86_expand_builtin (tree exp, rtx target, rtx > > subtarget, > > return const0_rtx; > > } > > > > + if (!IN_RANGE (INTVAL (op1), 0, 2)) > > + { > > + warning (0, "invalid second argument to" > > + " %<__builtin_ia32_prefetch%>; using zero"); > > + op1 = const0_rtx; > > + } > > + > > if (INTVAL (op3) == 1) > > { > > if (INTVAL (op2) < 2 || INTVAL (op2) > 3) @@ -14216,6 +14223,10 > > @@ ix86_expand_builtin (tree exp, rtx target, rtx subtarget, > > } > > else > > { > > + if (INTVAL (op3) != 0) > > + warning (0, "invalid forth argument to" > > + " %<__builtin_ia32_prefetch%>; using zero"); > > + > > if (!address_operand (op0, VOIDmode)) > > { > > op0 = convert_memory_address (Pmode, op0); diff --git > > a/gcc/testsuite/gcc.target/i386/pr117416-1.c > > b/gcc/testsuite/gcc.target/i386/pr117416-1.c > > new file mode 100644 > > index 000..7062f27e21a > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr117416-1.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O0" } */ > > + > > +#include > > + > > +void* p; > > + > > +void extern > > +prefetch_test (void) > > +{ > > + __builtin_ia32_prefetch (p, 5, 0, 0); /* { dg-warning "invalid second > > +argument to '__builtin_ia32_prefetch'; using zero" } */ } > > diff --git a/gcc/testsuite/gcc.target/i386/pr117416-2.c > > b/gcc/testsuite/gcc.target/i386/pr117416-2.c > > new file mode 100644 > > index 000..1397645cbfc > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr117416-2.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O0" } */ > > + > > +#include > > + > > +void* p; > > + > > +void extern > > +prefetch_test (void) > > +{ > > + __builtin_ia32_prefetch (p, 0, 0, 2); /* { dg-warning "invalid forth > > +argument to '__builtin_ia32_prefetch'; using zero" } */ } > > -- > > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Handling exception input of __builtin_ia32_prefetch. [PR117416]
On Tue, Nov 5, 2024 at 10:52 AM Hu, Lin1 wrote: > > Hi, all > > __builtin_ia32_prefetch's op1 should be between 0 and 2. So add an error > handler. > > Bootstrapped and regtested on x86_64-pc-linux-gnu, there is a unrelated FAIL > that has yet to be found root cause, just send patch for review. > > BRs, > Lin > > gcc/ChangeLog: > > PR target/117416 > * config/i386/i386-expand.cc (ix86_expand_builtin): Raise warning when > op1 isn't in range of (0, 2) and set op1 as const0_rtx; > > gcc/testsuite/ChangeLog: > > PR target/117416 > * gcc.target/i386/pr117416-1.c: New test. > --- > gcc/config/i386/i386-expand.cc | 7 +++ > gcc/testsuite/gcc.target/i386/pr117416-1.c | 12 > 2 files changed, 19 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/pr117416-1.c > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index 515334aa5a3..5dab5859463 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -14194,6 +14194,13 @@ ix86_expand_builtin (tree exp, rtx target, rtx > subtarget, > return const0_rtx; > } > > + if (!IN_RANGE (INTVAL (op1), 0, 2)) > + { > + warning (0, "invalid second argument to" > +" %<__builtin_ia32_prefetch%>; using zero"); > + op1 = const0_rtx; > + } > + op3 should be handled similarly, 1 indicates for instruction prefetch, 0 for data prefetch. > if (INTVAL (op3) == 1) > { > if (INTVAL (op2) < 2 || INTVAL (op2) > 3) > diff --git a/gcc/testsuite/gcc.target/i386/pr117416-1.c > b/gcc/testsuite/gcc.target/i386/pr117416-1.c > new file mode 100644 > index 000..7062f27e21a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr117416-1.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O0" } */ > + > +#include > + > +void* p; > + > +void extern > +prefetch_test (void) > +{ > + __builtin_ia32_prefetch (p, 5, 0, 0); /* { dg-warning "invalid second > argument to '__builtin_ia32_prefetch'; using zero" } */ > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH 0/2] Add arch support for Intel CPUs
On Fri, Nov 1, 2024 at 11:24 AM Haochen Jiang wrote: > > Hi all, > > I have just landed new ISA patches on trunk. The next step will > be the arch support for ISE055 mentioned CPUs. > > There are two changes in ISE055 on CPUs: > > - A new model number is added for Arrow Lake. > - Diamond Rapids Support is added. > > The following two patches will reflect those changes. > > Bootstraped and tested on x86_64-pc-linux-gnu. Ok for trunk and > ARL patch backport to GCC14? Ok. > > Ref: https://cdrdv2.intel.com/v1/dl/getContent/671368 > > Thx, > Haochen > > -- BR, Hongtao
Re: [PATCH] i386: Utilize VCOMSBF16 for BF16 Comparisons with AVX10.2
On Fri, Nov 1, 2024 at 8:33 AM Hongyu Wang wrote: > > From: Levy Hsu > > This patch enables the use of the VCOMSBF16 instruction from AVX10.2 for > efficient BF16 comparisons. > > Bootstrapped & regtested on x86-64-pc-linux-gnu. > Ok for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386-expand.cc (ix86_expand_branch): Handle BFmode > when TARGET_AVX10_2_256 is enabled. > (ix86_prepare_fp_compare_args): Use SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P. > (ix86_expand_fp_movcc): Ditto. > (ix86_expand_fp_compare): Handle BFmode under IX86_FPCMP_COMI. > * config/i386/i386.cc (ix86_multiplication_cost): Use > SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P. > (ix86_division_cost): Ditto. > (ix86_rtx_costs): Ditto. > (ix86_vector_costs::add_stmt_cost): Ditto. > * config/i386/i386.h (SSE_FLOAT_MODE_SSEMATH_OR_HF_P): Rename to ... > (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P): ...this, and add BFmode. > * config/i386/i386.md (*cmpibf): New define_insn. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-comibf-1.c: New test. > * gcc.target/i386/avx10_2-comibf-2.c: Ditto. > --- > gcc/config/i386/i386-expand.cc| 22 ++-- > gcc/config/i386/i386.cc | 22 ++-- > gcc/config/i386/i386.h| 7 +- > gcc/config/i386/i386.md | 33 +++-- > .../gcc.target/i386/avx10_2-comibf-1.c| 40 ++ > .../gcc.target/i386/avx10_2-comibf-2.c| 118 ++ > 6 files changed, 214 insertions(+), 28 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-comibf-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-comibf-2.c > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index 0de0e842731..96e4659da10 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -2531,6 +2531,10 @@ ix86_expand_branch (enum rtx_code code, rtx op0, rtx > op1, rtx label) >emit_jump_insn (gen_rtx_SET (pc_rtx, tmp)); >return; > > +case E_BFmode: > + gcc_assert (TARGET_AVX10_2_256 && !flag_trapping_math); > + goto simple; > + > case E_DImode: >if (TARGET_64BIT) > goto simple; > @@ -2797,9 +2801,9 @@ ix86_prepare_fp_compare_args (enum rtx_code code, rtx > *pop0, rtx *pop1) >bool unordered_compare = ix86_unordered_fp_compare (code); >rtx op0 = *pop0, op1 = *pop1; >machine_mode op_mode = GET_MODE (op0); > - bool is_sse = SSE_FLOAT_MODE_SSEMATH_OR_HF_P (op_mode); > + bool is_sse = SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (op_mode); > > - if (op_mode == BFmode) > + if (op_mode == BFmode && (!TARGET_AVX10_2_256 || flag_trapping_math)) > { >rtx op = gen_lowpart (HImode, op0); >if (CONST_INT_P (op)) > @@ -2918,10 +2922,14 @@ ix86_expand_fp_compare (enum rtx_code code, rtx op0, > rtx op1) > { > case IX86_FPCMP_COMI: >tmp = gen_rtx_COMPARE (CCFPmode, op0, op1); > - if (TARGET_AVX10_2_256 && (code == EQ || code == NE)) > - tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_OPTCOMX); > - if (unordered_compare) > - tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_NOTRAP); > + /* We only have vcomsbf16, No vcomubf16 nor vcomxbf16 */ > + if (GET_MODE (op0) != E_BFmode) > + { > + if (TARGET_AVX10_2_256 && (code == EQ || code == NE)) > + tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), > UNSPEC_OPTCOMX); > + if (unordered_compare) > + tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), > UNSPEC_NOTRAP); > + } >cmp_mode = CCFPmode; >emit_insn (gen_rtx_SET (gen_rtx_REG (CCFPmode, FLAGS_REG), tmp)); >break; > @@ -4636,7 +4644,7 @@ ix86_expand_fp_movcc (rtx operands[]) >&& !ix86_fp_comparison_operator (operands[1], VOIDmode)) > return false; > > - if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode)) > + if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) > { >machine_mode cmode; > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 473e4cbf10e..6ac3a5d55f2 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -21324,7 +21324,7 @@ ix86_multiplication_cost (const struct > processor_costs *cost, >if (VECTOR_MODE_P (mode)) > inner_mode = GET_MODE_INNER (mode); > > - if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode)) > + if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) > return inner_mode == DFmode ? cost->mulsd : cost->mulss; >else if (X87_FLOAT_MODE_P (mode)) > return cost->fmul; > @@ -21449,7 +21449,7 @@ ix86_division_cost (const struct processor_costs > *cost, >if (VECTOR_MODE_P (mode)) > inner_mode = GET_MODE_INNER (mode); > > - if (SSE_FLOAT_MODE_SSEMATH_OR_HF_P (mode)) > + if (SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) > return inner_mode == DFmode ? cost->divsd : cost->divss; >
Re: [PATCH v3 7/8] i386: Add else operand to masked loads.
On Sat, Nov 2, 2024 at 8:58 PM Robin Dapp wrote: > > From: Robin Dapp > > This patch adds a zero else operand to masked loads, in particular the > masked gather load builtins that are used for gather vectorization. > > gcc/ChangeLog: > > * config/i386/i386-expand.cc (ix86_expand_special_args_builtin): > Add else-operand handling. > (ix86_expand_builtin): Ditto. > * config/i386/predicates.md (vcvtne2ps2bf_parallel): New > predicate. > (maskload_else_operand): Ditto. > * config/i386/sse.md: Use predicate. > --- > gcc/config/i386/i386-expand.cc | 26 ++-- > gcc/config/i386/predicates.md | 4 ++ > gcc/config/i386/sse.md | 112 + > 3 files changed, 97 insertions(+), 45 deletions(-) > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index 0de0e842731..6c61f9f87c2 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -12995,10 +12995,11 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, > { >tree arg; >rtx pat, op; > - unsigned int i, nargs, arg_adjust, memory; > + unsigned int i, nargs, arg_adjust, memory = -1; >unsigned int constant = 100; >bool aligned_mem = false; > - rtx xops[4]; > + rtx xops[4] = {}; > + bool add_els = false; >enum insn_code icode = d->icode; >const struct insn_data_d *insn_p = &insn_data[icode]; >machine_mode tmode = insn_p->operand[0].mode; > @@ -13125,6 +13126,9 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, > case V4DI_FTYPE_PCV4DI_V4DI: > case V4SI_FTYPE_PCV4SI_V4SI: > case V2DI_FTYPE_PCV2DI_V2DI: > + /* Two actual args but an additional else operand. */ > + add_els = true; > + /* Fallthru. */ > case VOID_FTYPE_INT_INT64: >nargs = 2; >klass = load; > @@ -13397,6 +13401,12 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, >xops[i]= op; > } > > + if (add_els) > +{ > + xops[i] = CONST0_RTX (GET_MODE (xops[0])); > + nargs++; > +} > + >switch (nargs) > { > case 0: > @@ -13653,7 +13663,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx > subtarget, >enum insn_code icode, icode2; >tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0); >tree arg0, arg1, arg2, arg3, arg4; > - rtx op0, op1, op2, op3, op4, pat, pat2, insn; > + rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn; >machine_mode mode0, mode1, mode2, mode3, mode4; >unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl); >HOST_WIDE_INT bisa, bisa2; > @@ -15560,12 +15570,15 @@ rdseed_step: > op3 = copy_to_reg (op3); > op3 = lowpart_subreg (mode3, op3, GET_MODE (op3)); > } > + >if (!insn_data[icode].operand[5].predicate (op4, mode4)) > { > - error ("the last argument must be scale 1, 2, 4, 8"); > - return const0_rtx; > + error ("the last argument must be scale 1, 2, 4, 8"); > + return const0_rtx; > } > > + opels = CONST0_RTX (GET_MODE (subtarget)); > + >/* Optimize. If mask is known to have all high bits set, > replace op0 with pc_rtx to signal that the instruction > overwrites the whole destination and doesn't use its > @@ -15634,7 +15647,8 @@ rdseed_step: > } > } > > - pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4); > + pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels); > + >if (! pat) > return const0_rtx; >emit_insn (pat); > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md > index 053312bbe27..7c7d8f61f11 100644 > --- a/gcc/config/i386/predicates.md > +++ b/gcc/config/i386/predicates.md > @@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand" > >return true; > }) > + > +(define_predicate "maskload_else_operand" > + (and (match_code "const_int,const_vector") > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 36f8567b66f..41c1badbc00 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -28632,7 +28632,7 @@ (define_insn > "_maskstore" > (set_attr "btver2_decode" "vector") > (set_attr "mode" "")]) > > -(define_expand "maskload" > +(define_expand "maskload_1" >[(set (match_operand:V48_128_256 0 "register_operand") > (unspec:V48_128_256 > [(match_operand: 2 "register_operand") > @@ -28640,13 +28640,28 @@ (define_expand "maskload" > UNSPEC_MASKMOV))] >"TARGET_AVX") > > +(define_expand "maskload" > + [(set (match_operand:V48_128_256 0 "register_operand") > + (unspec:V48_128_256 > + [(match_operand: 2 "register_operand") > + (match_operand:V48_128_256 1 "memory_operand") > + (match_operand:V48_128_256 3 "const0_operand"
Re: [PATCH] [APX PPX] Avoid generating unmatched pushp/popp in pro/epilogue
On Thu, Jul 4, 2024 at 11:00 AM Hongtao Liu wrote: > > On Tue, Jul 2, 2024 at 11:24 AM Hongyu Wang wrote: > > > > Hi, > > > > According to APX spec, the pushp/popp pairs should be matched, > > otherwise the PPX hint cannot take effect and cause performance loss. > > > > In the ix86_expand_epilogue, there are several optimizations that may > > cause the epilogue using mov to restore the regs. Check if PPX applied > > and prevent usage of mov/leave in the epilogue. > > > > Bootstrapped/regtested on x86_64-pc-linux-gnu. > > > > Ok for trunk? > Ok. Please backport the fix to GCC14 branch. > > > > gcc/ChangeLog: > > > > * config/i386/i386.cc (ix86_expand_prologue): Set apx_ppx_used > > flag in m.fs with TARGET_APX_PPX && !crtl->calls_eh_return. > > (ix86_emit_save_regs): Emit ppx is available only when > > TARGET_APX_PPX && !crtl->calls_eh_return. > > (ix86_expand_epilogue): Don't restore reg using mov when > > apx_ppx_used flag is true. > > * config/i386/i386.h (struct machine_frame_state): > > Add apx_ppx_used flag. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/apx-ppx-2.c: New test. > > * gcc.target/i386/apx-ppx-3.c: Likewise. > > --- > > gcc/config/i386/i386.cc | 13 + > > gcc/config/i386/i386.h| 4 > > gcc/testsuite/gcc.target/i386/apx-ppx-2.c | 14 ++ > > gcc/testsuite/gcc.target/i386/apx-ppx-3.c | 7 +++ > > 4 files changed, 34 insertions(+), 4 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/apx-ppx-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/apx-ppx-3.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index bd7411190af..99def8d4a77 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -7429,6 +7429,7 @@ ix86_emit_save_regs (void) > > { > >int regno; > >rtx_insn *insn; > > + bool use_ppx = TARGET_APX_PPX && !crtl->calls_eh_return; > > > >if (!TARGET_APX_PUSH2POP2 > >|| !ix86_can_use_push2pop2 () > > @@ -7438,7 +7439,7 @@ ix86_emit_save_regs (void) > > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > > { > > insn = emit_insn (gen_push (gen_rtx_REG (word_mode, regno), > > - TARGET_APX_PPX)); > > + use_ppx)); > > RTX_FRAME_RELATED_P (insn) = 1; > > } > > } > > @@ -7469,7 +7470,7 @@ ix86_emit_save_regs (void) > > > > regno_list[0]), > > gen_rtx_REG (word_mode, > > > > regno_list[1]), > > -TARGET_APX_PPX)); > > +use_ppx)); > > RTX_FRAME_RELATED_P (insn) = 1; > > rtx dwarf = gen_rtx_SEQUENCE (VOIDmode, rtvec_alloc > > (3)); > > > > @@ -7502,7 +7503,7 @@ ix86_emit_save_regs (void) > > else > > { > > insn = emit_insn (gen_push (gen_rtx_REG (word_mode, regno), > > - TARGET_APX_PPX)); > > + use_ppx)); > > RTX_FRAME_RELATED_P (insn) = 1; > > aligned = true; > > } > > @@ -7511,7 +7512,7 @@ ix86_emit_save_regs (void) > > { > > insn = emit_insn (gen_push (gen_rtx_REG (word_mode, > >regno_list[0]), > > - TARGET_APX_PPX)); > > + use_ppx)); > > RTX_FRAME_RELATED_P (insn) = 1; > > } > > } > > @@ -8985,6 +8986,7 @@ ix86_expand_prologue (void) > >if (!frame.save_regs_using_mov) > > { > > ix86_emit_save_regs (); > > + m->fs.apx_ppx_used = TARGET_APX_PPX && !crtl->calls_eh_return; > > int_registers_saved = true; > > gcc_assert (m->fs.sp_offset == frame.reg_save_offset); > > } > > @@ -9870,6 +9872,9 @@ ix86_expand_epilogue (int style) > >
Re: [PATCH v2 7/8] i386: Add else operand to masked loads.
On Fri, Oct 18, 2024 at 10:23 PM Robin Dapp wrote: > > This patch adds a zero else operand to masked loads, in particular the > masked gather load builtins that are used for gather vectorization. > > gcc/ChangeLog: > > * config/i386/i386-expand.cc (ix86_expand_special_args_builtin): > Add else-operand handling. > (ix86_expand_builtin): Ditto. > * config/i386/predicates.md (vcvtne2ps2bf_parallel): New > predicate. > (maskload_else_operand): Ditto. > * config/i386/sse.md: Use predicate. > --- > gcc/config/i386/i386-expand.cc | 26 +-- > gcc/config/i386/predicates.md | 4 ++ > gcc/config/i386/sse.md | 124 - > 3 files changed, 101 insertions(+), 53 deletions(-) > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index 63f5e348d64..f6a2c2d65b8 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -12994,10 +12994,11 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, > { >tree arg; >rtx pat, op; > - unsigned int i, nargs, arg_adjust, memory; > + unsigned int i, nargs, arg_adjust, memory = -1; >unsigned int constant = 100; >bool aligned_mem = false; > - rtx xops[4]; > + rtx xops[4] = {}; > + bool add_els = false; >enum insn_code icode = d->icode; >const struct insn_data_d *insn_p = &insn_data[icode]; >machine_mode tmode = insn_p->operand[0].mode; > @@ -13124,6 +13125,9 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, > case V4DI_FTYPE_PCV4DI_V4DI: > case V4SI_FTYPE_PCV4SI_V4SI: > case V2DI_FTYPE_PCV2DI_V2DI: > + /* Two actual args but an additional else operand. */ > + add_els = true; > + /* Fallthru. */ > case VOID_FTYPE_INT_INT64: >nargs = 2; >klass = load; > @@ -13396,6 +13400,12 @@ ix86_expand_special_args_builtin (const struct > builtin_description *d, >xops[i]= op; > } > > + if (add_els) > +{ > + xops[i] = CONST0_RTX (GET_MODE (xops[0])); > + nargs++; > +} > + >switch (nargs) > { > case 0: > @@ -13652,7 +13662,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx > subtarget, >enum insn_code icode, icode2; >tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0); >tree arg0, arg1, arg2, arg3, arg4; > - rtx op0, op1, op2, op3, op4, pat, pat2, insn; > + rtx op0, op1, op2, op3, op4, opels, pat, pat2, insn; >machine_mode mode0, mode1, mode2, mode3, mode4; >unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl); >HOST_WIDE_INT bisa, bisa2; > @@ -15559,12 +15569,15 @@ rdseed_step: > op3 = copy_to_reg (op3); > op3 = lowpart_subreg (mode3, op3, GET_MODE (op3)); > } > + >if (!insn_data[icode].operand[5].predicate (op4, mode4)) > { > - error ("the last argument must be scale 1, 2, 4, 8"); > - return const0_rtx; > + error ("the last argument must be scale 1, 2, 4, 8"); > + return const0_rtx; > } > > + opels = CONST0_RTX (GET_MODE (subtarget)); > + >/* Optimize. If mask is known to have all high bits set, > replace op0 with pc_rtx to signal that the instruction > overwrites the whole destination and doesn't use its > @@ -15633,7 +15646,8 @@ rdseed_step: > } > } > > - pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4); > + pat = GEN_FCN (icode) (subtarget, op0, op1, op2, op3, op4, opels); > + >if (! pat) > return const0_rtx; >emit_insn (pat); > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md > index 053312bbe27..7c7d8f61f11 100644 > --- a/gcc/config/i386/predicates.md > +++ b/gcc/config/i386/predicates.md > @@ -2346,3 +2346,7 @@ (define_predicate "apx_evex_add_memory_operand" > >return true; > }) > + > +(define_predicate "maskload_else_operand" > + (and (match_code "const_int,const_vector") > + (match_test "op == CONST0_RTX (GET_MODE (op))"))) > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index a45b50ad732..83955eee5a0 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -1575,7 +1575,8 @@ (define_expand "_load_mask" > } >else if (MEM_P (operands[1])) > operands[1] = gen_rtx_UNSPEC (mode, > -gen_rtvec(1, operands[1]), > +gen_rtvec(2, operands[1], > + CONST0_RTX (mode)), > UNSPEC_MASKLOAD); > }) > > @@ -1583,7 +1584,8 @@ (define_insn "*_load_mask" >[(set (match_operand:V48_AVX512VL 0 "register_operand" "=v") > (vec_merge:V48_AVX512VL > (unspec:V48_AVX512VL > - [(match_operand:V48_AVX512VL 1 "memory_operand" "m")] > + [(match_operand:V48_AVX512VL 1 "memory_operand" "m") > +(match_operand:V48_A
Re: [PATCH] testsuite: Adjust AVX10.2 check_effective_target
On Tue, Oct 29, 2024 at 5:04 PM Haochen Jiang wrote: > > Hi all, > > Since Binutils haven't fully merged all AVX10.2 insts, only testing > one inst/intrin in AVX10.2 is never sufficient for check_effective_target. > Like APX_F, use inline asm to do the target check. > > Testes w/ and w/o Binutils with full AVX10.2 support. Ok for trunk? Ok. > > Thx, > Haochen > > gcc/testsuite/ChangeLog: > > PR target/117301 > * lib/target-supports.exp (check_effective_target_avx10_2): > Use inline asm instead of intrin for check_effective_target. > (check_effective_target_avx10_2_512): Ditto. > --- > gcc/testsuite/lib/target-supports.exp | 34 +++ > 1 file changed, 14 insertions(+), 20 deletions(-) > > diff --git a/gcc/testsuite/lib/target-supports.exp > b/gcc/testsuite/lib/target-supports.exp > index 70f74d1e288..9c65fd0fd7b 100644 > --- a/gcc/testsuite/lib/target-supports.exp > +++ b/gcc/testsuite/lib/target-supports.exp > @@ -10748,17 +10748,14 @@ proc check_effective_target_apxf { } { > # Return 1 if avx10.2 instructions can be compiled. > proc check_effective_target_avx10_2 { } { > return [check_no_compiler_messages avx10.2 object { > - typedef int __v8si __attribute__ ((__vector_size__ (32))); > - typedef char __mmask8; > - > - __v8si > - _mm256_mask_vpdpbssd_epi32 (__v8si __A, __mmask8 __U, > - __v8si __B, __v8si __C) > + void > + foo () > { > - return (__v8si) __builtin_ia32_vpdpbssd_v8si_mask ((__v8si)__A, > -(__v8si)__B, > -(__v8si)__C, > -(__mmask8)__U); > + __asm__ volatile ("vdpphps\t%ymm4, %ymm5, %ymm6"); > + __asm__ volatile ("vcvthf82ph\t%xmm5, %ymm6"); > + __asm__ volatile ("vaddnepbf16\t%ymm4, %ymm5, %ymm6"); > + __asm__ volatile ("vcvtph2ibs\t%ymm5, %ymm6"); > + __asm__ volatile ("vminmaxpd\t$123, %ymm4, %ymm5, %ymm6"); > } > } "-mavx10.2" ] > } > @@ -10766,17 +10763,14 @@ proc check_effective_target_avx10_2 { } { > # Return 1 if avx10.2-512 instructions can be compiled. > proc check_effective_target_avx10_2_512 { } { > return [check_no_compiler_messages avx10.2-512 object { > - typedef int __v16si __attribute__ ((__vector_size__ (64))); > - typedef short __mmask16; > - > - __v16si > - _mm512_vpdpbssd_epi32 (__v16si __A, __mmask16 __U, > - __v16si __B, __v16si __C) > + void > + foo () > { > - return (__v16si) __builtin_ia32_vpdpbssd_v16si_mask ((__v16si)__A, > - (__v16si)__B, > - (__v16si)__C, > - > (__mmask16)__U); > + __asm__ volatile ("vdpphps\t%zmm4, %zmm5, %zmm6"); > + __asm__ volatile ("vcvthf82ph\t%ymm5, %zmm6"); > + __asm__ volatile ("vaddnepbf16\t%zmm4, %zmm5, %zmm6"); > + __asm__ volatile ("vcvtph2ibs\t%zmm5, %zmm6"); > + __asm__ volatile ("vminmaxpd\t$123, %zmm4, %zmm5, %zmm6"); > } > } "-mavx10.2-512" ] > } > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH 0/7] Support Intel Diamond Rapid new features
On Tue, Oct 22, 2024 at 2:31 PM Haochen Jiang wrote: > > Hi all, > > ISE054 has just been released and you can find doc from here: > > https://cdrdv2.intel.com/v1/dl/getContent/671368 > > Diamond Rapids features are added in this ISE, including AMX > related instructions, SM4 EVEX extension and MOVRS/PREFETCHRST2. > > The following seven patches will add all the new features into GCC. > > After these patches, we will add Diamond Rapids arch option to > GCC15. > > Bootstrapped and tested on x86_64-pc-linux-gnu. Ok for trunk? Ok. > > Thx, > Haochen > > -- BR, Hongtao
Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions
On Fri, Oct 25, 2024 at 12:19 AM Antoni Boucher wrote: > > Thanks. > Did you review the new patch? > Can I push it to master? Ok. > > Le 2024-10-20 à 22 h 01, Hongtao Liu a écrit : > > On Sat, Oct 19, 2024 at 2:06 AM Antoni Boucher wrote: > >> > >> Thanks for the review. > >> Here's the updated patch. > >> > >> Le 2024-10-17 à 21 h 50, Hongtao Liu a écrit : > >>> On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher wrote: > >>>> > >>>> Hi. > >>>> This is a patch for the bug 116725. > >>>> I'm not sure if it is a good fix, but it seems to do the job. > >>>> If you have suggestions for better comments than what I wrote that would > >>>> explain what's happening, I'm open to suggestions. > >>> > >>>> @@ -7548,7 +7548,8 @@ (define_insn > >>>> "avx512fp16_vcvtph2_< > >>>> [(match_operand: 1 "" > >>>> "")] > >>>> UNSPEC_US_FIX_NOTRUNC))] > >>>> "TARGET_AVX512FP16 && " > >>>> - > >>>> "vcvtph2\t{%1, > >>>> %0|%0, %1}" > >>>> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel. > >>>> + > >>>> "vcvtph2\t{%1, > >>>> %0|%0, %X1}" > >>> Could you define something like > >>> > >>>;; Pointer size override for 16-bit upper-convert modes (Intel asm > >>> dialect) > >>>(define_mode_attr iptrh > >>> [(V32HI "") (V16SI "") (V8DI "") > >>> (V16HI "") (V8SI "") (V4DI "q") > >>> (V8HI "") (V4SI "q") (V2DI "k")]) > >> > >> For my own understanding, was my usage of %X equivalent to a mode_attr > >> with an empty string for all cases? > >> How did you know which one needed an empty string? > > > > It's in ix86_print_operand > > 14155 else if (MEM_P (x)) > > 14156{ > > 14157 rtx addr = XEXP (x, 0); > > 14158 > > 14159 /* No `byte ptr' prefix for call instructions ... */ > > 14160 if (ASSEMBLER_DIALECT == ASM_INTEL && code != 'X' && code != 'P') > > 14161{ > > 14162 machine_mode mode = GET_MODE (x); > > 14163 const char *size; > > 14164 > > 14165 /* Check for explicit size override codes. */ > > 14166 if (code == 'b') > > 14167size = "BYTE"; > > 14168 else if (code == 'w') > > 14169size = "WORD"; > > 14170 else if (code == 'k') > > 14171size = "DWORD"; > > 14172 else if (code == 'q') > > 14173size = "QWORD"; > > 14174 else if (code == 'x') > > 14175size = "XMMWORD"; > > 14176 else if (code == 't') > > 14177size = "YMMWORD"; > > 14178 else if (code == 'g') > > 14179size = "ZMMWORD"; > > 14180 else if (mode == BLKmode) > > 14181/* ... or BLKmode operands, when not overridden. */ > > 14182size = NULL; > > 14183 else > > 14184switch (GET_MODE_SIZE (mode)) > > 14185 { > > 14186 case 1: size = "BYTE"; break; > > > >> > >>> > >>> And use > >>> + "vcvtph2\t{%1, > >>> %0|%0, %1}" > >>> > >>>> [(set_attr "type" "ssecvt") > >>>> (set_attr "prefix" "evex") > >>>> (set_attr "mode" "")]) > >>>> @@ -29854,7 +29855,8 @@ (define_insn > >>>> "avx512dq_vmfpclass" > >>>>UNSPEC_FPCLASS) > >>>> (const_int 1)))] > >>>> "TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)" > >>>> - "vfpclass\t{%2, %1, > >>>> %0|%0, %1, %2}"; > >>>> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel. > >>>> + "vfpclass\t{%2, %1, > >>>> %0|%0, %X1, %2}"; > >>> > >>> For scaar memory operand rewrite, we usually use , so > >>> "vfpclass\t{%2, %1, > >>> %0|%0, > >>> %1, %2}"; > >>> > >>> > >>> > >>> > > > > > > > -- BR, Hongtao
Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions
On Sat, Oct 19, 2024 at 2:06 AM Antoni Boucher wrote: > > Thanks for the review. > Here's the updated patch. > > Le 2024-10-17 à 21 h 50, Hongtao Liu a écrit : > > On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher wrote: > >> > >> Hi. > >> This is a patch for the bug 116725. > >> I'm not sure if it is a good fix, but it seems to do the job. > >> If you have suggestions for better comments than what I wrote that would > >> explain what's happening, I'm open to suggestions. > > > >> @@ -7548,7 +7548,8 @@ (define_insn > >> "avx512fp16_vcvtph2_< > >> [(match_operand: 1 "" > >> "")] > >> UNSPEC_US_FIX_NOTRUNC))] > >>"TARGET_AVX512FP16 && " > >> - "vcvtph2\t{%1, > >> %0|%0, %1}" > >> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel. > >> + "vcvtph2\t{%1, > >> %0|%0, %X1}" > > Could you define something like > > > > ;; Pointer size override for 16-bit upper-convert modes (Intel asm > > dialect) > > (define_mode_attr iptrh > >[(V32HI "") (V16SI "") (V8DI "") > > (V16HI "") (V8SI "") (V4DI "q") > > (V8HI "") (V4SI "q") (V2DI "k")]) > > For my own understanding, was my usage of %X equivalent to a mode_attr > with an empty string for all cases? > How did you know which one needed an empty string? It's in ix86_print_operand 14155 else if (MEM_P (x)) 14156{ 14157 rtx addr = XEXP (x, 0); 14158 14159 /* No `byte ptr' prefix for call instructions ... */ 14160 if (ASSEMBLER_DIALECT == ASM_INTEL && code != 'X' && code != 'P') 14161{ 14162 machine_mode mode = GET_MODE (x); 14163 const char *size; 14164 14165 /* Check for explicit size override codes. */ 14166 if (code == 'b') 14167size = "BYTE"; 14168 else if (code == 'w') 14169size = "WORD"; 14170 else if (code == 'k') 14171size = "DWORD"; 14172 else if (code == 'q') 14173size = "QWORD"; 14174 else if (code == 'x') 14175size = "XMMWORD"; 14176 else if (code == 't') 14177size = "YMMWORD"; 14178 else if (code == 'g') 14179size = "ZMMWORD"; 14180 else if (mode == BLKmode) 14181/* ... or BLKmode operands, when not overridden. */ 14182size = NULL; 14183 else 14184switch (GET_MODE_SIZE (mode)) 14185 { 14186 case 1: size = "BYTE"; break; > > > > > And use > > + "vcvtph2\t{%1, > > %0|%0, %1}" > > > >>[(set_attr "type" "ssecvt") > >> (set_attr "prefix" "evex") > >> (set_attr "mode" "")]) > >> @@ -29854,7 +29855,8 @@ (define_insn > >> "avx512dq_vmfpclass" > >> UNSPEC_FPCLASS) > >> (const_int 1)))] > >> "TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)" > >> - "vfpclass\t{%2, %1, > >> %0|%0, %1, %2}"; > >> +;; %X1 so that we don't emit any *WORD PTR for -masm=intel. > >> + "vfpclass\t{%2, %1, > >> %0|%0, %X1, %2}"; > > > > For scaar memory operand rewrite, we usually use , so > > "vfpclass\t{%2, %1, > > %0|%0, > > %1, %2}"; > > > > > > > > -- BR, Hongtao
Re: [PATCH] target: Fix asm codegen for vfpclasss* and vcvtph2* instructions
On Fri, Oct 18, 2024 at 9:08 AM Antoni Boucher wrote: > > Hi. > This is a patch for the bug 116725. > I'm not sure if it is a good fix, but it seems to do the job. > If you have suggestions for better comments than what I wrote that would > explain what's happening, I'm open to suggestions. >@@ -7548,7 +7548,8 @@ (define_insn >"avx512fp16_vcvtph2_< > [(match_operand: 1 "" > "")] > UNSPEC_US_FIX_NOTRUNC))] > "TARGET_AVX512FP16 && " >- "vcvtph2\t{%1, >%0|%0, %1}" >+;; %X1 so that we don't emit any *WORD PTR for -masm=intel. >+ "vcvtph2\t{%1, >%0|%0, %X1}" Could you define something like ;; Pointer size override for 16-bit upper-convert modes (Intel asm dialect) (define_mode_attr iptrh [(V32HI "") (V16SI "") (V8DI "") (V16HI "") (V8SI "") (V4DI "q") (V8HI "") (V4SI "q") (V2DI "k")]) And use + "vcvtph2\t{%1, %0|%0, %1}" > [(set_attr "type" "ssecvt") >(set_attr "prefix" "evex") >(set_attr "mode" "")]) >@@ -29854,7 +29855,8 @@ (define_insn >"avx512dq_vmfpclass" > UNSPEC_FPCLASS) >(const_int 1)))] >"TARGET_AVX512DQ || VALID_AVX512FP16_REG_MODE(mode)" >- "vfpclass\t{%2, %1, >%0|%0, %1, %2}"; >+;; %X1 so that we don't emit any *WORD PTR for -masm=intel. >+ "vfpclass\t{%2, %1, >%0|%0, %X1, %2}"; For scaar memory operand rewrite, we usually use , so "vfpclass\t{%2, %1, %0|%0, %1, %2}"; -- BR, Hongtao
Re: [PATCH] testsuite: Fix typos for AVX10.2 convert testcases
On Thu, Oct 17, 2024 at 3:17 PM Haochen Jiang wrote: > > From: Victor Rodriguez > > Hi all, > > There are some typos in AVX10.2 vcvtne[,2]ph[b,h]f8[,s] testcases. > They will lead to type mismatch. > > Previously they are not found due to the binutils did not checkin. > > Ok for trunk? Ok. > > Thx, > Haochen > > --- > > Fix typos related to types for vcvtne[,2]ph[b,h]f8[,s] testcases. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c: Fix typo. > * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c: Ditto. > --- > .../gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c | 10 +- > .../gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c | 10 +- > 8 files changed, 40 insertions(+), 40 deletions(-) > > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c > index 0dd58ee710e..7e7865d64fe 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c > @@ -65,16 +65,16 @@ TEST (void) >CALC(res_ref, src1.a, src2.a); > >res1.x = INTRINSIC (_cvtne2ph_pbf8) (src1.x, src2.x); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref)) > abort (); > >res2.x = INTRINSIC (_mask_cvtne2ph_pbf8) (res2.x, mask, src1.x, src2.x); > - MASK_MERGE (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + MASK_MERGE (i_b) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref)) > abort (); > >res3.x = INTRINSIC (_maskz_cvtne2ph_pbf8) (mask, src1.x, src2.x); > - MASK_ZERO (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + MASK_ZERO (i_b) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, i_b) (res3, res_ref)) > abort (); > } > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c > index 5e3ea3e37a4..0ca0c420ff7 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c > @@ -65,16 +65,16 @@ TEST (void) >CALC(res_ref, src1.a, src2.a); > >res1.x = INTRINSIC (_cvtnes2ph_pbf8) (src1.x, src2.x); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref)) > abort (); > >res2.x = INTRINSIC (_mask_cvtnes2ph_pbf8) (res2.x, mask, src1.x, src2.x); > - MASK_MERGE (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + MASK_MERGE (i_b) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref)) > abort (); > >res3.x = INTRINSIC (_maskz_cvtnes2ph_pbf8) (mask, src1.x, src2.x); > - MASK_ZERO (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + MASK_ZERO (i_b) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, i_b) (res3, res_ref)) > abort (); > } > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c > index aa928b582b3..97afd395bb5 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c > @@ -65,16 +65,16 @@ TEST (void) >CALC(res_ref, src1.a, src2.a); > >res1.x = INTRINSIC (_cvtne2ph_phf8) (src1.x, src2.x); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + if (UNION_CHECK (AVX512F_LEN, i_b) (res1, res_ref)) > abort (); > >res2.x = INTRINSIC (_mask_cvtne2ph_phf8) (res2.x, mask, src1.x, src2.x); > - MASK_MERGE (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) > + MASK_MERGE (i_b) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, i_b) (res2, res_ref)) > abort (); > >res3.x = INTRINSIC (_maskz_cvtne2ph_phf8) (mask, src1.x, src2.x); > - MASK_ZERO (h) (res_ref, mask, SIZE); > - if (UNION_CHECK (AVX512F_LEN, i_b) (res, res_ref)) >
Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization
On Mon, Oct 14, 2024 at 1:50 PM Richard Biener wrote: > > On Mon, 14 Oct 2024, Hongtao Liu wrote: > > > On Sun, Oct 13, 2024 at 8:02 PM Richard Biener wrote: > > > > > > On Sun, 13 Oct 2024, Hongtao Liu wrote: > > > > > > > On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu wrote: > > > > > > > > > > On Fri, Oct 11, 2024 at 8:22 PM Richard Biener > > > > > wrote: > > > > > > > > > > > > The following helps the x86 backend by canonicalizing FMAs to have > > > > > > any negation done to one of the commutative multiplication operands > > > > > > be done to a register (and not a memory operand). Likewise to > > > > > > put a register operand first and a memory operand second; > > > > > > swap_commutative_operands_p seems to treat REG_P and MEM_P the > > > > > > same but comments indicate "complex expressiosn should be first". > > > > > > > > > > > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and > > > > > > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the > > > > > > reasons for the testsuite regressions in > > > > > > gcc.target/i386/cond_op_fma*.c > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > > > > > I'm not quite sure this is the correct approach - simplify-rtx > > > > > > doesn't seem to do "only canonicalization" but the existing FMA > > > > > > case looks odd in that context. > > > > > > > > > > > > Should the target simply reject cases with wrong "canonicalization" > > > > > > or does it need to cope with all variants in the patterns that fail > > > > > > matching during combine without the change? > > > > > Let me try the backend fix first. > > > > The backend fix requires at least 8 more patterns, so I think RTL > > > > canonicalization would be better. > > > > Please go ahead with the patch. > > > > > > I'm still looking for insights on how we usually canonicalize on RTL > > > (and where we document canonicalizations) and how we maintain RTL > > > in canonical form. > > > > > > I'm also still wondering why the backend accepts "non-canonical" RTL > > > instead of rejecting it, giving the middle-end the chance to try > > > an alternative variant? > > So you mean middle-end will alway canonicalize (fma: reg mem reg) to > > (fma: mem reg reg)? > > I only saw that RTL will canonicalize (fma: a (neg: b) c) to (fma: (neg a) > > b c). > > Or what do you mean about "non-canonical" RTL in the backend? > > "non-canonical" RTL in the backend is what the patterns in question > for this bugreport do not accept. But maybe I'm missing something here. > > IIRC there's code somewhere in combine to try several "canonical" > varaints of an insns in recog_for_combine, but as said I'm not very > familiar with how things work on RTL to decide what's conceptually the > correct thing to do here. I just discvered simplify_rtx already does some > minor canonicalization for FMAs ... I think FMA itself is ok, reg or mem is also ok, the problem is with masked FMA(with mult operands). so there are (vec_merge: (fma: op1 op2 op3) op1 mask) or (vec_merge: (fma:op2 op1 op3) op1 mask) They're the same instruction since op1 and op2 are commutative. Either the middle-end should canonicalize them, or the backend adds an extra alternative to match the potential optimization. So maybe in combine.cc: maybe_swap_commutative_operands, let's always canonicalize to (vec_merge: (fma: op1 op2 op3) op1 mask), together with the backend patch(which relaxes the predicates + some other potential changes for the pattern), should fix the issue. > > Richard. > > > > > > > Richard. > > > > > > > > > > > > > > > Thanks, > > > > > > Richard. > > > > > > > > > > > > PR target/117072 > > > > > > * simplify-rtx.cc > > > > > > (simplify_context::simplify_ternary_operation): > > > > > > Adjust FMA canonicalization. > > > > > > --- > > > > > > gcc/simplify-rtx.cc | 15 +-- > > > > > > 1 file changed, 13 insertions(+), 2 d
Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization
On Sun, Oct 13, 2024 at 8:02 PM Richard Biener wrote: > > On Sun, 13 Oct 2024, Hongtao Liu wrote: > > > On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu wrote: > > > > > > On Fri, Oct 11, 2024 at 8:22 PM Richard Biener wrote: > > > > > > > > The following helps the x86 backend by canonicalizing FMAs to have > > > > any negation done to one of the commutative multiplication operands > > > > be done to a register (and not a memory operand). Likewise to > > > > put a register operand first and a memory operand second; > > > > swap_commutative_operands_p seems to treat REG_P and MEM_P the > > > > same but comments indicate "complex expressiosn should be first". > > > > > > > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and > > > > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the > > > > reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > > > I'm not quite sure this is the correct approach - simplify-rtx > > > > doesn't seem to do "only canonicalization" but the existing FMA > > > > case looks odd in that context. > > > > > > > > Should the target simply reject cases with wrong "canonicalization" > > > > or does it need to cope with all variants in the patterns that fail > > > > matching during combine without the change? > > > Let me try the backend fix first. > > The backend fix requires at least 8 more patterns, so I think RTL > > canonicalization would be better. > > Please go ahead with the patch. > > I'm still looking for insights on how we usually canonicalize on RTL > (and where we document canonicalizations) and how we maintain RTL > in canonical form. > > I'm also still wondering why the backend accepts "non-canonical" RTL > instead of rejecting it, giving the middle-end the chance to try > an alternative variant? So you mean middle-end will alway canonicalize (fma: reg mem reg) to (fma: mem reg reg)? I only saw that RTL will canonicalize (fma: a (neg: b) c) to (fma: (neg a) b c). Or what do you mean about "non-canonical" RTL in the backend? > > Richard. > > > > > > > > > Thanks, > > > > Richard. > > > > > > > > PR target/117072 > > > > * simplify-rtx.cc > > > > (simplify_context::simplify_ternary_operation): > > > > Adjust FMA canonicalization. > > > > --- > > > > gcc/simplify-rtx.cc | 15 +-- > > > > 1 file changed, 13 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc > > > > index e8e60404ef6..8b4fa0d7aa4 100644 > > > > --- a/gcc/simplify-rtx.cc > > > > +++ b/gcc/simplify-rtx.cc > > > > @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation > > > > (rtx_code code, machine_mode mode, > > > > op0 = tem, op1 = XEXP (op1, 0), any_change = true; > > > > } > > > > > > > > - /* Canonicalize the two multiplication operands. */ > > > > + /* Canonicalize the two multiplication operands. A negation > > > > +should go first and if possible the negation should be > > > > +to a register. */ > > > >/* a * -b + c => -b * a + c. */ > > > > - if (swap_commutative_operands_p (op0, op1)) > > > > + if (swap_commutative_operands_p (op0, op1) > > > > + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG)) > > > > std::swap (op0, op1), any_change = true; > > > > + else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0)) > > > > + && REG_P (op1)) > > > > + { > > > > + op0 = XEXP (op0, 0); > > > > + op1 = simplify_gen_unary (NEG, mode, op1, mode); > > > > + std::swap (op0, op1); > > > > + any_change = true; > > > > + } > > > > > > > >if (any_change) > > > > return gen_rtx_FMA (mode, op0, op1, op2); > > > > -- > > > > 2.43.0 > > > > > > > > > > > > -- > > > BR, > > > Hongtao > > > > > > > > > > -- > Richard Biener > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; > HRB 36809 (AG Nuernberg) -- BR, Hongtao
Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization
On Fri, Oct 11, 2024 at 8:33 PM Hongtao Liu wrote: > > On Fri, Oct 11, 2024 at 8:22 PM Richard Biener wrote: > > > > The following helps the x86 backend by canonicalizing FMAs to have > > any negation done to one of the commutative multiplication operands > > be done to a register (and not a memory operand). Likewise to > > put a register operand first and a memory operand second; > > swap_commutative_operands_p seems to treat REG_P and MEM_P the > > same but comments indicate "complex expressiosn should be first". > > > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and > > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the > > reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > I'm not quite sure this is the correct approach - simplify-rtx > > doesn't seem to do "only canonicalization" but the existing FMA > > case looks odd in that context. > > > > Should the target simply reject cases with wrong "canonicalization" > > or does it need to cope with all variants in the patterns that fail > > matching during combine without the change? > Let me try the backend fix first. The backend fix requires at least 8 more patterns, so I think RTL canonicalization would be better. Please go ahead with the patch. > > > > Thanks, > > Richard. > > > > PR target/117072 > > * simplify-rtx.cc (simplify_context::simplify_ternary_operation): > > Adjust FMA canonicalization. > > --- > > gcc/simplify-rtx.cc | 15 +-- > > 1 file changed, 13 insertions(+), 2 deletions(-) > > > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc > > index e8e60404ef6..8b4fa0d7aa4 100644 > > --- a/gcc/simplify-rtx.cc > > +++ b/gcc/simplify-rtx.cc > > @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation > > (rtx_code code, machine_mode mode, > > op0 = tem, op1 = XEXP (op1, 0), any_change = true; > > } > > > > - /* Canonicalize the two multiplication operands. */ > > + /* Canonicalize the two multiplication operands. A negation > > +should go first and if possible the negation should be > > +to a register. */ > >/* a * -b + c => -b * a + c. */ > > - if (swap_commutative_operands_p (op0, op1)) > > + if (swap_commutative_operands_p (op0, op1) > > + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG)) > > std::swap (op0, op1), any_change = true; > > + else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0)) > > + && REG_P (op1)) > > + { > > + op0 = XEXP (op0, 0); > > + op1 = simplify_gen_unary (NEG, mode, op1, mode); > > + std::swap (op0, op1); > > + any_change = true; > > + } > > > >if (any_change) > > return gen_rtx_FMA (mode, op0, op1, op2); > > -- > > 2.43.0 > > > > -- > BR, > Hongtao -- BR, Hongtao
Re: [PATCH] [RFC] target/117072 - more RTL FMA canonicalization
On Fri, Oct 11, 2024 at 8:22 PM Richard Biener wrote: > > The following helps the x86 backend by canonicalizing FMAs to have > any negation done to one of the commutative multiplication operands > be done to a register (and not a memory operand). Likewise to > put a register operand first and a memory operand second; > swap_commutative_operands_p seems to treat REG_P and MEM_P the > same but comments indicate "complex expressiosn should be first". > > In particular this does (fma MEM REG REG) -> (fma REG MEM REG) and > (fma (neg MEM) REG REG) -> (fma (neg REG) MEM REG) which are the > reasons for the testsuite regressions in gcc.target/i386/cond_op_fma*.c > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > I'm not quite sure this is the correct approach - simplify-rtx > doesn't seem to do "only canonicalization" but the existing FMA > case looks odd in that context. > > Should the target simply reject cases with wrong "canonicalization" > or does it need to cope with all variants in the patterns that fail > matching during combine without the change? Let me try the backend fix first. > > Thanks, > Richard. > > PR target/117072 > * simplify-rtx.cc (simplify_context::simplify_ternary_operation): > Adjust FMA canonicalization. > --- > gcc/simplify-rtx.cc | 15 +-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc > index e8e60404ef6..8b4fa0d7aa4 100644 > --- a/gcc/simplify-rtx.cc > +++ b/gcc/simplify-rtx.cc > @@ -6830,10 +6830,21 @@ simplify_context::simplify_ternary_operation > (rtx_code code, machine_mode mode, > op0 = tem, op1 = XEXP (op1, 0), any_change = true; > } > > - /* Canonicalize the two multiplication operands. */ > + /* Canonicalize the two multiplication operands. A negation > +should go first and if possible the negation should be > +to a register. */ >/* a * -b + c => -b * a + c. */ > - if (swap_commutative_operands_p (op0, op1)) > + if (swap_commutative_operands_p (op0, op1) > + || (REG_P (op1) && !REG_P (op0) && GET_CODE (op0) != NEG)) > std::swap (op0, op1), any_change = true; > + else if (GET_CODE (op0) == NEG && !REG_P (XEXP (op0, 0)) > + && REG_P (op1)) > + { > + op0 = XEXP (op0, 0); > + op1 = simplify_gen_unary (NEG, mode, op1, mode); > + std::swap (op0, op1); > + any_change = true; > + } > >if (any_change) > return gen_rtx_FMA (mode, op0, op1, op2); > -- > 2.43.0 -- BR, Hongtao
Re: [PATCH] x86: Implement Fast-Math Float Truncation to BF16 via PSRLD Instruction
On Tue, Oct 8, 2024 at 3:24 PM Levy Hsu wrote: > > Bootstrapped and tested on x86_64-linux-gnu, OK for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386.md: Rewrite insn truncsfbf2. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/truncsfbf-1.c: New test. > * gcc.target/i386/truncsfbf-2.c: New test. > --- > gcc/config/i386/i386.md | 16 ++--- > gcc/testsuite/gcc.target/i386/truncsfbf-1.c | 9 +++ > gcc/testsuite/gcc.target/i386/truncsfbf-2.c | 65 + > 3 files changed, 83 insertions(+), 7 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/truncsfbf-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/truncsfbf-2.c > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index 9c2a0aa6112..d3fee0968d8 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -5672,16 +5672,18 @@ > (set_attr "mode" "HF")]) > > (define_insn "truncsfbf2" > - [(set (match_operand:BF 0 "register_operand" "=x, v") > + [(set (match_operand:BF 0 "register_operand" "=x,x,v,Yv") > (float_truncate:BF > - (match_operand:SF 1 "register_operand" "x,v")))] > - "((TARGET_AVX512BF16 && TARGET_AVX512VL) || TARGET_AVXNECONVERT) > - && !HONOR_NANS (BFmode) && flag_unsafe_math_optimizations" > + (match_operand:SF 1 "register_operand" "0,x,v,Yv")))] > + "TARGET_SSE2 && flag_unsafe_math_optimizations && !HONOR_NANS (BFmode)" >"@ > + psrld\t{$16, %0|%0, 16} >%{vex%} vcvtneps2bf16\t{%1, %0|%0, %1} > - vcvtneps2bf16\t{%1, %0|%0, %1}" > - [(set_attr "isa" "avxneconvert,avx512bf16vl") > - (set_attr "prefix" "vex,evex")]) > + vcvtneps2bf16\t{%1, %0|%0, %1} > + vpsrld\t{$16, %1, %0|%0, %1, 16}" > + [(set_attr "isa" "noavx,avxneconvert,avx512bf16vl,avx") > + (set_attr "prefix" "orig,vex,evex,vex") > + (set_attr "type" "sseishft1,ssecvt,ssecvt,sseishft1")]) > > ;; Signed conversion to DImode. > > diff --git a/gcc/testsuite/gcc.target/i386/truncsfbf-1.c > b/gcc/testsuite/gcc.target/i386/truncsfbf-1.c > new file mode 100644 > index 000..dd3ff8a50b4 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/truncsfbf-1.c > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-msse2 -O2 -ffast-math" } */ > +/* { dg-final { scan-assembler-times "psrld" 1 } } */ > + > +__bf16 > +foo (float a) > +{ > + return a; > +} > diff --git a/gcc/testsuite/gcc.target/i386/truncsfbf-2.c > b/gcc/testsuite/gcc.target/i386/truncsfbf-2.c > new file mode 100644 > index 000..f4952f88fc9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/truncsfbf-2.c > @@ -0,0 +1,65 @@ > +/* { dg-do run } */ > +/* { dg-options "-msse2 -O2 -ffast-math" } */ > + > +#include > +#include > +#include > +#include > + > +__bf16 > +foo (float a) > +{ > + return a; > +} > + > +static __bf16 > +CALC (float *a) > +{ > + uint32_t bits; > + memcpy (&bits, a, sizeof (bits)); > + bits >>= 16; > + uint16_t bfloat16_bits = (uint16_t) bits; > + __bf16 bf16; > + memcpy (&bf16, &bfloat16_bits, sizeof (bf16)); > + return bf16; > +} > + > +int > +main (void) > +{ > + float test_values[] = { 0.0f, -0.0f, 1.0f, -1.0f, 0.5f, -0.5f, 1000.0f, > -1000.0f, > + 3.1415926f, -3.1415926f, 1e-8f, -1e-8f, > + 1.0e+38f, -1.0e+38f, 1.0e-38f, -1.0e-38f }; > + size_t num_values = sizeof (test_values) / sizeof (test_values[0]); > + > + for (size_t i = 0; i < num_values; ++i) > +{ > + float original = test_values[i]; > + __bf16 hw_bf16 = foo (original); > + __bf16 sw_bf16 = CALC (&original); > + > + /* Verify psrld $16, %0 == %0 >> 16 */ > + if (memcmp (&hw_bf16, &sw_bf16, sizeof (__bf16)) != 0) > +abort (); > + > + /* Reconstruct the float value from the __bf16 bits */ > + uint16_t bf16_bits; > + memcpy (&bf16_bits, &hw_bf16, sizeof (bf16_bits)); > + uint32_t reconstructed_bits = ((uint32_t) bf16_bits) << 16; > + float converted; > + memcpy (&converted, &reconstructed_bits, sizeof (converted)); > + > + float diff = fabsf (original - converted); > + > + /* Expected Maximum Precision Loss */ > + uint32_t orig_bits; > + memcpy (&orig_bits, &original, sizeof (orig_bits)); > + int exponent = ((orig_bits >> 23) & 0xFF) - 127; > + float expected_loss = (exponent == -127) > +? ldexpf (1.0f, -126 - 7) > +: ldexpf (1.0f, exponent - 7); > + if (diff > expected_loss) > +abort (); > +} > + return 0; > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH v2 2/2] Adjust testcase after relax O2 vectorization.
On Tue, Oct 8, 2024 at 4:56 PM Richard Biener wrote: > > On Tue, Oct 8, 2024 at 10:36 AM liuhongt wrote: > > > > gcc/testsuite/ChangeLog: > > > > * gcc.dg/fstack-protector-strong.c: Adjust > > scan-assembler-times. > > * gcc.dg/graphite/scop-6.c: Add > > -Wno-aggressive-loop-optimizations. > > * gcc.dg/graphite/scop-9.c: Ditto. > > * gcc.dg/tree-ssa/ivopts-lt-2.c: Add -fno-tree-vectorize. > > * gcc.dg/tree-ssa/ivopts-lt.c: Ditto. > > * gcc.dg/tree-ssa/loop-16.c: Ditto. > > * gcc.dg/tree-ssa/loop-28.c: Ditto. > > * gcc.dg/tree-ssa/loop-bound-2.c: Ditto. > > * gcc.dg/tree-ssa/loop-bound-4.c: Ditto. > > * gcc.dg/tree-ssa/loop-bound-6.c: Ditto. > > * gcc.dg/tree-ssa/predcom-4.c: Ditto. > > * gcc.dg/tree-ssa/predcom-5.c: Ditto. > > * gcc.dg/tree-ssa/scev-11.c: Ditto. > > * gcc.dg/tree-ssa/scev-9.c: Ditto. > > * gcc.dg/tree-ssa/split-path-11.c: Ditto. > > * gcc.dg/unroll-8.c: Ditto. > > * gcc.dg/var-expand1.c: Ditto. > > * gcc.dg/vect/vect-cost-model-6.c: Ditto. > > * gcc.target/i386/pr86270.c: Ditto. > > * gcc.target/i386/pr86722.c: Ditto. > > * gcc.target/x86_64/abi/callabi/leaf-2.c: Ditto. > > --- > > gcc/testsuite/gcc.dg/fstack-protector-strong.c | 2 +- > > gcc/testsuite/gcc.dg/graphite/scop-6.c | 1 + > > gcc/testsuite/gcc.dg/graphite/scop-9.c | 1 + > > gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c| 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/loop-16.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/loop-28.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/loop-bound-2.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/loop-bound-4.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/loop-bound-6.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/predcom-4.c| 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/predcom-5.c| 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/scev-11.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/scev-9.c | 2 +- > > gcc/testsuite/gcc.dg/tree-ssa/split-path-11.c| 2 +- > > gcc/testsuite/gcc.dg/unroll-8.c | 3 +-- > > gcc/testsuite/gcc.dg/var-expand1.c | 2 +- > > gcc/testsuite/gcc.dg/vect/vect-cost-model-6.c| 2 +- > > gcc/testsuite/gcc.target/i386/pr86270.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr86722.c | 2 +- > > gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +- > > 21 files changed, 21 insertions(+), 20 deletions(-) > > > > diff --git a/gcc/testsuite/gcc.dg/fstack-protector-strong.c > > b/gcc/testsuite/gcc.dg/fstack-protector-strong.c > > index 94dc3508f1a..b9f63966b7c 100644 > > --- a/gcc/testsuite/gcc.dg/fstack-protector-strong.c > > +++ b/gcc/testsuite/gcc.dg/fstack-protector-strong.c > > @@ -154,4 +154,4 @@ void foo12 () > >global3 (); > > } > > > > -/* { dg-final { scan-assembler-times "stack_chk_fail" 12 } } */ > > +/* { dg-final { scan-assembler-times "stack_chk_fail" 11 } } */ > > diff --git a/gcc/testsuite/gcc.dg/graphite/scop-6.c > > b/gcc/testsuite/gcc.dg/graphite/scop-6.c > > index 9bc1d9f4ccd..6ea887d9041 100644 > > --- a/gcc/testsuite/gcc.dg/graphite/scop-6.c > > +++ b/gcc/testsuite/gcc.dg/graphite/scop-6.c > > @@ -26,4 +26,5 @@ int toto() > >return a[3][5] + b[2]; > > } > > The testcase looks bogus: > >b[i+k] = b[i+k-5] + 2; > > accesses b[-3], can you instead adjust the inner loop to start with k == 4? > > > +/* { dg-additional-options "-Wno-aggressive-loop-optimizations" } */ > > /* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite"} } > > */ > > diff --git a/gcc/testsuite/gcc.dg/graphite/scop-9.c > > b/gcc/testsuite/gcc.dg/graphite/scop-9.c > > index b19291be2f8..2a36bf92fd4 100644 > > --- a/gcc/testsuite/gcc.dg/graphite/scop-9.c > > +++ b/gcc/testsuite/gcc.dg/graphite/scop-9.c > > @@ -21,4 +21,5 @@ int toto() > >return a[3][5] + b[2]; > > } > > Likewise. > > > +/* { dg-additional-options "-Wno-aggressive-loop-optimizations" } */ > > /* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite"} } > > */ > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c > > b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c > > index bdbdbff19ff..be325775fbb 100644 > > --- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c > > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c > > @@ -1,5 +1,5 @@ > > /* { dg-do compile } */ > > -/* { dg-options "-O2 -fno-tree-loop-distribute-patterns > > -fdump-tree-ivopts" } */ > > +/* { dg-options "-O2 -fno-tree-vectorize > > -fno-tree-loop-distribute-patterns -fdump-tree-ivopts" } */ > > /* { dg-skip-if "PR68644" { hppa*-*-* powerpc*-*-* } } */ > > > > void > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c > > b/gcc/testsuite/gcc.dg/tree-ssa/iv
Re: [PATCH v2] x86/{,V}AES: adjust when to force EVEX encoding
On Tue, Oct 8, 2024 at 3:00 PM Jan Beulich wrote: > > On 08.10.2024 08:54, Hongtao Liu wrote: > > On Mon, Sep 30, 2024 at 3:33 PM Jan Beulich wrote: > >> > >> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly > >> said "..., but we need to emit {evex} prefix in the assembly if AES ISA > >> is not enabled". Yet it did so only for the TARGET_AES insns. Going from > >> the alternative chosen in the TARGET_VAES insns isn't quite right: If > >> AES is (also) enabled, EVEX encoding would needlessly be forced. > >> > >> gcc/ > >> > >> * config/i386/sse.md (vaesdec_, vaesdeclast_, > >> vaesenc_, vaesenclast_): Replace which_alternative > >> check by TARGET_AES one. > >> --- > >> As an aside - {evex} (and other) pseudo-prefixes would better be avoided > >> anyway whenever possible, as those are getting in the way of code > >> putting in place macro overrides for certain insns: gas 2.43 rejects > >> such bogus placement of pseudo-prefixes. > >> > >> Is it, btw, correct that none of these insns have a "prefix" attribute? > > There's some automatic in i386.md to determine the prefix, rough, not > > very accurate. > > > > 688;; Prefix used: original, VEX or maybe VEX. > > 689(define_attr "prefix" "orig,vex,maybe_vex,evex,maybe_evex" > > 690 (cond [(eq_attr "mode" "OI,V8SF,V4DF") > > 691 (const_string "vex") > > 692 (eq_attr "mode" "XI,V16SF,V8DF") > > 693 (const_string "evex") > > 694 (eq_attr "type" "ssemuladd") > > 695 (if_then_else (eq_attr "isa" "fma4") > > 696 (const_string "vex") > > 697 (const_string "maybe_evex")) > > 698 (eq_attr "type" "sse4arg") > > 699 (const_string "vex") > > 700] > > 701(const_string "orig"))) > > 702 > > I'm aware, and I raised the question because it seemed pretty clear to > me that it wouldn't get things right here. AFAIK It's mainly for attr length to determine the codesize of instructions( ix86_min_insn_size), and the rough model should be sufficient for most cases. > > >> --- > >> v2: Adjust (shrink) description. > > Ok for the patch. > > Thanks. What about the 14.x branch? Also Ok. > > Jan -- BR, Hongtao
Re: [PATCH v2] x86/{,V}AES: adjust when to force EVEX encoding
On Mon, Sep 30, 2024 at 3:33 PM Jan Beulich wrote: > > Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly > said "..., but we need to emit {evex} prefix in the assembly if AES ISA > is not enabled". Yet it did so only for the TARGET_AES insns. Going from > the alternative chosen in the TARGET_VAES insns isn't quite right: If > AES is (also) enabled, EVEX encoding would needlessly be forced. > > gcc/ > > * config/i386/sse.md (vaesdec_, vaesdeclast_, > vaesenc_, vaesenclast_): Replace which_alternative > check by TARGET_AES one. > --- > As an aside - {evex} (and other) pseudo-prefixes would better be avoided > anyway whenever possible, as those are getting in the way of code > putting in place macro overrides for certain insns: gas 2.43 rejects > such bogus placement of pseudo-prefixes. > > Is it, btw, correct that none of these insns have a "prefix" attribute? There's some automatic in i386.md to determine the prefix, rough, not very accurate. 688;; Prefix used: original, VEX or maybe VEX. 689(define_attr "prefix" "orig,vex,maybe_vex,evex,maybe_evex" 690 (cond [(eq_attr "mode" "OI,V8SF,V4DF") 691 (const_string "vex") 692 (eq_attr "mode" "XI,V16SF,V8DF") 693 (const_string "evex") 694 (eq_attr "type" "ssemuladd") 695 (if_then_else (eq_attr "isa" "fma4") 696 (const_string "vex") 697 (const_string "maybe_evex")) 698 (eq_attr "type" "sse4arg") 699 (const_string "vex") 700] 701(const_string "orig"))) 702 > --- > v2: Adjust (shrink) description. Ok for the patch. > > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -30802,7 +30802,7 @@ > UNSPEC_VAESDEC))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesdec\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesdec\t{%2, %1, %0|%0, %1, %2}"; > @@ -30816,7 +30816,7 @@ > UNSPEC_VAESDECLAST))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesdeclast\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesdeclast\t{%2, %1, %0|%0, %1, %2}"; > @@ -30830,7 +30830,7 @@ > UNSPEC_VAESENC))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesenc\t{%2, %1, %0|%0, %1, %2}"; > @@ -30844,7 +30844,7 @@ > UNSPEC_VAESENCLAST))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesenclast\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesenclast\t{%2, %1, %0|%0, %1, %2}"; -- BR, Hongtao
Re: [PATCH] x86: Extend AVX512 Vectorization for Popcount in Various Modes
On Tue, Sep 24, 2024 at 10:16 AM Levy Hsu wrote: > > This patch enables vectorization of the popcount operation for V2QI, V4QI, > V8QI, V2HI, V4HI, and V2SI modes. Ok. > > gcc/ChangeLog: > > * config/i386/mmx.md: > (VQI_16_32_64): New mode iterator for 8-byte, 4-byte, and 2-byte > QImode. > (popcount2): New pattern for popcount of V2QI/V4QI/V8QI mode. > (popcount2): New pattern for popcount of V2HI/V4HI mode. > (popcountv2si2): New pattern for popcount of V2SI mode. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/part-vect-popcount-1.c: New test. > --- > gcc/config/i386/mmx.md| 24 + > .../gcc.target/i386/part-vect-popcount-1.c| 49 +++ > 2 files changed, 73 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index 4bc191b874b..147ae150bf3 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -70,6 +70,9 @@ > ;; 8-byte and 4-byte HImode vector modes > (define_mode_iterator VI2_32_64 [(V4HI "TARGET_MMX_WITH_SSE") V2HI]) > > +;; 8-byte, 4-byte and 2-byte QImode vector modes > +(define_mode_iterator VI1_16_32_64 [(V8QI "TARGET_MMX_WITH_SSE") V4QI V2QI]) > + > ;; 4-byte and 2-byte integer vector modes > (define_mode_iterator VI_16_32 [V4QI V2QI V2HI]) > > @@ -6786,3 +6789,24 @@ >[(set_attr "type" "mmx") > (set_attr "modrm" "0") > (set_attr "memory" "none")]) > + > +(define_insn "popcount2" > + [(set (match_operand:VI1_16_32_64 0 "register_operand" "=v") > + (popcount:VI1_16_32_64 > + (match_operand:VI1_16_32_64 1 "register_operand" "v")))] > + "TARGET_AVX512VL && TARGET_AVX512BITALG" > + "vpopcntb\t{%1, %0|%0, %1}") > + > +(define_insn "popcount2" > + [(set (match_operand:VI2_32_64 0 "register_operand" "=v") > + (popcount:VI2_32_64 > + (match_operand:VI2_32_64 1 "register_operand" "v")))] > + "TARGET_AVX512VL && TARGET_AVX512BITALG" > + "vpopcntw\t{%1, %0|%0, %1}") > + > +(define_insn "popcountv2si2" > + [(set (match_operand:V2SI 0 "register_operand" "=v") > + (popcount:V2SI > + (match_operand:V2SI 1 "register_operand" "v")))] > + "TARGET_AVX512VPOPCNTDQ && TARGET_AVX512VL && TARGET_MMX_WITH_SSE" > + "vpopcntd\t{%1, %0|%0, %1}") > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c > b/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c > new file mode 100644 > index 000..a30f6ec4726 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/part-vect-popcount-1.c > @@ -0,0 +1,49 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx512vpopcntdq -mavx512bitalg -mavx512vl" } */ > +/* { dg-final { scan-assembler-times "vpopcntd\[^\n\r\]*xmm\[0-9\]" 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "vpopcntw\[^\n\r\]*xmm\[0-9\]" 3 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vpopcntw\[^\n\r\]*xmm\[0-9\]" 2 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "vpopcntb\[^\n\r\]*xmm\[0-9\]" 4 { > target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vpopcntb\[^\n\r\]*xmm\[0-9\]" 3 { > target { ! ia32 } } } } */ > + > +void > +foo1 (int* a, int* __restrict b) > +{ > + for (int i = 0; i != 2; i++) > +a[i] = __builtin_popcount (b[i]); > +} > + > +void > +foo2 (unsigned short* a, unsigned short* __restrict b) > +{ > + for (int i = 0; i != 4; i++) > +a[i] = __builtin_popcount (b[i]); > +} > + > +void > +foo3 (unsigned short* a, unsigned short* __restrict b) > +{ > + for (int i = 0; i != 2; i++) > +a[i] = __builtin_popcount (b[i]); > +} > + > +void > +foo4 (unsigned char* a, unsigned char* __restrict b) > +{ > + for (int i = 0; i != 8; i++) > +a[i] = __builtin_popcount (b[i]); > +} > + > +void > +foo5 (unsigned char* a, unsigned char* __restrict b) > +{ > + for (int i = 0; i != 4; i++) > +a[i] = __builtin_popcount (b[i]); > +} > + > +void > +foo6 (unsigned char* a, unsigned char* __restrict b) > +{ > + for (int i = 0; i != 2; i++) > +a[i] = __builtin_popcount (b[i]); > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386, v2: Add GENERIC and GIMPLE folders of __builtin_ia32_{min,max}* [PR116738]
On Wed, Sep 25, 2024 at 4:42 PM Jakub Jelinek wrote: > > On Wed, Sep 25, 2024 at 10:17:50AM +0800, Hongtao Liu wrote: > > > + for (int i = 0; i < 2; ++i) > > > + { > > > + unsigned count = vector_cst_encoded_nelts (args[i]), j; > > > + for (j = 0; j < count; ++j) > > > + if (!tree_expr_nan_p (VECTOR_CST_ENCODED_ELT (args[i], > > > j))) > > Is this a typo? I assume you want to check if the component is NAN, so > > tree_expr_nan_p, not !tree_expr_nan_p? > > > + break; > > > + if (j < count) > > > + break; > > Also this break just break the outer loop(for (int i = 0; i < 2; > > i++)), but according to comments, it wants to break the outer switch? > > You're right, thanks for catching that. Fortunately both meant just that > it got NaNs optimized too and optimized the rest as it should. > > I just wanted to avoid return NULL_TREE; or goto and screwed it up. > > Here is a fixed version, tested additionally on looking at gimple dump on > typedef float __v4sf __attribute__((vector_size (16))); > __v4sf foo (void) { return __builtin_ia32_minss ((__v4sf) { __builtin_nanf > (""), 0.f, 0.f, 0.f }, (__v4sf) { __builtin_inff (), 1.0f, 2.0f, 3.0f }); } > __v4sf bar (void) { return __builtin_ia32_minss ((__v4sf) { -__builtin_inff > (), 0.f, 0.f, 0.f }, (__v4sf) { __builtin_inff (), 1.0f, 2.0f, 3.0f }); } > > Ok for trunk if it passes bootstrap/regtest? Ok. > > 2024-09-25 Jakub Jelinek > > PR target/116738 > * config/i386/i386.cc (ix86_fold_builtin): Handle > IX86_BUILTIN_M{IN,AX}{S,P}{S,H,D}*. > (ix86_gimple_fold_builtin): Handle IX86_BUILTIN_M{IN,AX}P{S,H,D}*. > > * gcc.target/i386/avx512f-pr116738-1.c: New test. > * gcc.target/i386/avx512f-pr116738-2.c: New test. > > --- gcc/config/i386/i386.cc.jj 2024-09-24 18:54:24.120313544 +0200 > +++ gcc/config/i386/i386.cc 2024-09-25 10:21:00.922417024 +0200 > @@ -18507,6 +18507,8 @@ ix86_fold_builtin (tree fndecl, int n_ar > = (enum ix86_builtins) DECL_MD_FUNCTION_CODE (fndecl); >enum rtx_code rcode; >bool is_vshift; > + enum tree_code tcode; > + bool is_scalar; >unsigned HOST_WIDE_INT mask; > >switch (fn_code) > @@ -18956,6 +18958,131 @@ ix86_fold_builtin (tree fndecl, int n_ar > } > break; > > + case IX86_BUILTIN_MINSS: > + case IX86_BUILTIN_MINSH_MASK: > + tcode = LT_EXPR; > + is_scalar = true; > + goto do_minmax; > + > + case IX86_BUILTIN_MAXSS: > + case IX86_BUILTIN_MAXSH_MASK: > + tcode = GT_EXPR; > + is_scalar = true; > + goto do_minmax; > + > + case IX86_BUILTIN_MINPS: > + case IX86_BUILTIN_MINPD: > + case IX86_BUILTIN_MINPS256: > + case IX86_BUILTIN_MINPD256: > + case IX86_BUILTIN_MINPS512: > + case IX86_BUILTIN_MINPD512: > + case IX86_BUILTIN_MINPS128_MASK: > + case IX86_BUILTIN_MINPD128_MASK: > + case IX86_BUILTIN_MINPS256_MASK: > + case IX86_BUILTIN_MINPD256_MASK: > + case IX86_BUILTIN_MINPH128_MASK: > + case IX86_BUILTIN_MINPH256_MASK: > + case IX86_BUILTIN_MINPH512_MASK: > + tcode = LT_EXPR; > + is_scalar = false; > + goto do_minmax; > + > + case IX86_BUILTIN_MAXPS: > + case IX86_BUILTIN_MAXPD: > + case IX86_BUILTIN_MAXPS256: > + case IX86_BUILTIN_MAXPD256: > + case IX86_BUILTIN_MAXPS512: > + case IX86_BUILTIN_MAXPD512: > + case IX86_BUILTIN_MAXPS128_MASK: > + case IX86_BUILTIN_MAXPD128_MASK: > + case IX86_BUILTIN_MAXPS256_MASK: > + case IX86_BUILTIN_MAXPD256_MASK: > + case IX86_BUILTIN_MAXPH128_MASK: > + case IX86_BUILTIN_MAXPH256_MASK: > + case IX86_BUILTIN_MAXPH512_MASK: > + tcode = GT_EXPR; > + is_scalar = false; > + do_minmax: > + gcc_assert (n_args >= 2); > + if (TREE_CODE (args[0]) != VECTOR_CST > + || TREE_CODE (args[1]) != VECTOR_CST) > + break; > + mask = HOST_WIDE_INT_M1U; > + if (n_args > 2) > + { > + gcc_assert (n_args >= 4); > + /* This is masked minmax. */ > + if (TREE_CODE (args[3]) != INTEGER_CST > + || TREE_SIDE_EFFECTS (args[2])) > + break; > + mask = TREE_INT_CST_LOW (args[3]); > + unsigned elems = TYPE_VECTOR_SUBPARTS (
Re: [PATCH] x86/{,V}AES: adjust when to force EVEX encoding
On Wed, Sep 25, 2024 at 3:55 PM Jan Beulich wrote: > > On 25.09.2024 09:38, Hongtao Liu wrote: > > On Wed, Sep 25, 2024 at 2:56 PM Jan Beulich wrote: > >> > >> Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly > >> said "..., but we need to emit {evex} prefix in the assembly if AES ISA > >> is not enabled". Yet it did so only for the TARGET_AES insns. Going from > >> the alternative chosen in the TARGET_VAES insns is wrong for two > >> reasons: > >> - if, with AES disabled, the latter alternative was chosen despite no > >> "high" XMM register nor any eGPR in use, gas would still pick the AES > > w/o EVEX SSE REG or EGPR, the first alternative will always be > > matched(alternative 0). > > That is how it works(match from left to right). > > Well, if that's guaranteed to always be the case, then ... > > >> (VEX) encoding when no {evex} pseudo-prefix is in use (which is > >> against - as stated by the description of said commit - AES presently > >> not being considered a prereq of VAES in gcc); > > > >> - if AES is (also) enabled, EVEX encoding would needlessly be forced. > > So it's more like an optimization that use VEX encoding when AES is enabled? > > ... in a way it's an optimization, yes. I can adjust the description > accordingly. However, it's not _just_ an optimization, it also is a > fix for compilation (really: assembly) failing in ... > > >> --- > >> As an aside - {evex} (and other) pseudo-prefixes would better be avoided > >> anyway whenever possible, as those are getting in the way of code > >> putting in place macro overrides for certain insns: gas 2.43 rejects > >> such bogus placement of pseudo-prefixes. So it sounds like a walkaround in GCC to avoid the gas bug? In general, I'm ok for the patch since we already did that in TARGET_AES patterns. 27060(define_insn "aesenc" 27061 [(set (match_operand:V2DI 0 "register_operand" "=x,x,v") 27062(unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v") 27063 (match_operand:V2DI 2 "vector_operand" "xja,xjm,vm")] 27064 UNSPEC_AESENC))] 27065 "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)" 27066 "@ 27067 aesenc\t{%2, %0|%0, %2} 27068 * return TARGET_AES ? \"vaesenc\t{%2, %1, %0|%0, %1, %2}\" : \"%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}\"; 27069 vaesenc\t{%2, %1, %0|%0, %1, %2}" 27070 [(set_attr "isa" "noavx,avx,vaes_avx512vl") > > ... cases like this (which is how I actually came to notice the issue). > > Jan -- BR, Hongtao
Re: [PATCH] x86/{,V}AES: adjust when to force EVEX encoding
On Wed, Sep 25, 2024 at 2:56 PM Jan Beulich wrote: > > Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly > said "..., but we need to emit {evex} prefix in the assembly if AES ISA > is not enabled". Yet it did so only for the TARGET_AES insns. Going from > the alternative chosen in the TARGET_VAES insns is wrong for two > reasons: > - if, with AES disabled, the latter alternative was chosen despite no > "high" XMM register nor any eGPR in use, gas would still pick the AES w/o EVEX SSE REG or EGPR, the first alternative will always be matched(alternative 0). That is how it works(match from left to right). > (VEX) encoding when no {evex} pseudo-prefix is in use (which is > against - as stated by the description of said commit - AES presently > not being considered a prereq of VAES in gcc); > - if AES is (also) enabled, EVEX encoding would needlessly be forced. So it's more like an optimization that use VEX encoding when AES is enabled? > > gcc/ > > * config/i386/sse.md (vaesdec_, vaesdeclast_, > vaesenc_, vaesenclast_): Replace which_alternative > check by TARGET_AES one. > --- > As an aside - {evex} (and other) pseudo-prefixes would better be avoided > anyway whenever possible, as those are getting in the way of code > putting in place macro overrides for certain insns: gas 2.43 rejects > such bogus placement of pseudo-prefixes. > > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -30802,7 +30802,7 @@ > UNSPEC_VAESDEC))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesdec\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesdec\t{%2, %1, %0|%0, %1, %2}"; > @@ -30816,7 +30816,7 @@ > UNSPEC_VAESDECLAST))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesdeclast\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesdeclast\t{%2, %1, %0|%0, %1, %2}"; > @@ -30830,7 +30830,7 @@ > UNSPEC_VAESENC))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesenc\t{%2, %1, %0|%0, %1, %2}"; > @@ -30844,7 +30844,7 @@ > UNSPEC_VAESENCLAST))] >"TARGET_VAES" > { > - if (which_alternative == 0 && mode == V16QImode) > + if (!TARGET_AES && mode == V16QImode) > return "%{evex%} vaesenclast\t{%2, %1, %0|%0, %1, %2}"; >else > return "vaesenclast\t{%2, %1, %0|%0, %1, %2}"; -- BR, Hongtao
Re: [PATCH] i386: Add GENERIC and GIMPLE folders of __builtin_ia32_{min,max}* [PR116738]
On Wed, Sep 25, 2024 at 1:07 AM Jakub Jelinek wrote: > > Hi! > > The following patch adds GENERIC and GIMPLE folders for various > x86 min/max builtins. > As discussed, these builtins have effectively x < y ? x : y > (or x > y ? x : y) behavior. > The GENERIC folding is done if all the (relevant) arguments are > constants (such as VECTOR_CST for vectors) and is done because > the GIMPLE folding can't easily handle masking, rounding and the > ss/sd cases (in a way that it would be pattern recognized back to the > corresponding instructions). The GIMPLE folding is also done just > for TARGET_SSE4 or later when optimizing, otherwise it is apparently > not matched back. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > > 2024-09-24 Jakub Jelinek > > PR target/116738 > * config/i386/i386.cc (ix86_fold_builtin): Handle > IX86_BUILTIN_M{IN,AX}{S,P}{S,H,D}*. > (ix86_gimple_fold_builtin): Handle IX86_BUILTIN_M{IN,AX}P{S,H,D}*. > > * gcc.target/i386/avx512f-pr116738-1.c: New test. > * gcc.target/i386/avx512f-pr116738-2.c: New test. > > --- gcc/config/i386/i386.cc.jj 2024-09-12 10:56:57.344683959 +0200 > +++ gcc/config/i386/i386.cc 2024-09-23 15:15:40.154783766 +0200 > @@ -18507,6 +18507,8 @@ ix86_fold_builtin (tree fndecl, int n_ar > = (enum ix86_builtins) DECL_MD_FUNCTION_CODE (fndecl); >enum rtx_code rcode; >bool is_vshift; > + enum tree_code tcode; > + bool is_scalar; >unsigned HOST_WIDE_INT mask; > >switch (fn_code) > @@ -18956,6 +18958,133 @@ ix86_fold_builtin (tree fndecl, int n_ar > } > break; > > + case IX86_BUILTIN_MINSS: > + case IX86_BUILTIN_MINSH_MASK: > + tcode = LT_EXPR; > + is_scalar = true; > + goto do_minmax; > + > + case IX86_BUILTIN_MAXSS: > + case IX86_BUILTIN_MAXSH_MASK: > + tcode = GT_EXPR; > + is_scalar = true; > + goto do_minmax; > + > + case IX86_BUILTIN_MINPS: > + case IX86_BUILTIN_MINPD: > + case IX86_BUILTIN_MINPS256: > + case IX86_BUILTIN_MINPD256: > + case IX86_BUILTIN_MINPS512: > + case IX86_BUILTIN_MINPD512: > + case IX86_BUILTIN_MINPS128_MASK: > + case IX86_BUILTIN_MINPD128_MASK: > + case IX86_BUILTIN_MINPS256_MASK: > + case IX86_BUILTIN_MINPD256_MASK: > + case IX86_BUILTIN_MINPH128_MASK: > + case IX86_BUILTIN_MINPH256_MASK: > + case IX86_BUILTIN_MINPH512_MASK: > + tcode = LT_EXPR; > + is_scalar = false; > + goto do_minmax; > + > + case IX86_BUILTIN_MAXPS: > + case IX86_BUILTIN_MAXPD: > + case IX86_BUILTIN_MAXPS256: > + case IX86_BUILTIN_MAXPD256: > + case IX86_BUILTIN_MAXPS512: > + case IX86_BUILTIN_MAXPD512: > + case IX86_BUILTIN_MAXPS128_MASK: > + case IX86_BUILTIN_MAXPD128_MASK: > + case IX86_BUILTIN_MAXPS256_MASK: > + case IX86_BUILTIN_MAXPD256_MASK: > + case IX86_BUILTIN_MAXPH128_MASK: > + case IX86_BUILTIN_MAXPH256_MASK: > + case IX86_BUILTIN_MAXPH512_MASK: > + tcode = GT_EXPR; > + is_scalar = false; > + do_minmax: > + gcc_assert (n_args >= 2); > + if (TREE_CODE (args[0]) != VECTOR_CST > + || TREE_CODE (args[1]) != VECTOR_CST) > + break; > + mask = HOST_WIDE_INT_M1U; > + if (n_args > 2) > + { > + gcc_assert (n_args >= 4); > + /* This is masked minmax. */ > + if (TREE_CODE (args[3]) != INTEGER_CST > + || TREE_SIDE_EFFECTS (args[2])) > + break; > + mask = TREE_INT_CST_LOW (args[3]); > + unsigned elems = TYPE_VECTOR_SUBPARTS (TREE_TYPE (args[0])); > + mask |= HOST_WIDE_INT_M1U << elems; > + if (mask != HOST_WIDE_INT_M1U > + && TREE_CODE (args[2]) != VECTOR_CST) > + break; > + if (n_args >= 5) > + { > + if (!tree_fits_uhwi_p (args[4])) > + break; > + if (tree_to_uhwi (args[4]) != 4 > + && tree_to_uhwi (args[4]) != 8) > + break; > + } > + if (mask == (HOST_WIDE_INT_M1U << elems)) > + return args[2]; > + } > + /* Punt on NaNs, unless exceptions are disabled. */ > + if (HONOR_NANS (args[0]) > + && (n_args < 5 || tree_to_uhwi (args[4]) != 8)) > + for (int i = 0; i < 2; ++i) > + { > + unsigned count = vector_cst_encoded_nelts (args[i]), j; > + for (j = 0; j < count; ++j) > + if (!tree_expr_nan_p (VECTOR_CST_ENCODED_ELT (args[i], j))) Is this a typo? I assume you want to check if the component is NAN, so tree_expr_nan_p, not !tree_expr_nan_p? > + break; > + if (j < count) > +
Re: [PATCH] [x86] Define VECTOR_STORE_FLAG_VALUE
On Tue, Sep 24, 2024 at 5:46 PM Uros Bizjak wrote: > > On Tue, Sep 24, 2024 at 11:23 AM liuhongt wrote: > > > > Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT. > > Otherwise NULL_RTX. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ready push to trunk. > > > > gcc/ChangeLog: > > > > * config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro. > > > > gcc/testsuite/ChangeLog: > > * gcc.dg/rtl/x86_64/vector_eq.c: New test. > > --- > > gcc/config/i386/i386.h | 5 +++- > > gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c | 26 + > > 2 files changed, 30 insertions(+), 1 deletion(-) > > create mode 100644 gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > index c1ec92ffb15..b12be41424f 100644 > > --- a/gcc/config/i386/i386.h > > +++ b/gcc/config/i386/i386.h > > @@ -899,7 +899,10 @@ extern const char *host_detect_local_cpu (int argc, > > const char **argv); > > and give entire struct the alignment of an int. */ > > /* Required on the 386 since it doesn't have bit-field insns. */ > > #define PCC_BITFIELD_TYPE_MATTERS 1 > > - > > + > > +#define VECTOR_STORE_FLAG_VALUE(MODE) \ > > + (GET_MODE_CLASS (MODE) == MODE_VECTOR_INT ? constm1_rtx : NULL_RTX) > > + > > /* Standard register usage. */ > > > > /* This processor has special stack-like registers. See reg-stack.cc > > diff --git a/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c > > b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c > > new file mode 100644 > > index 000..b82603d0b64 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/rtl/x86_64/vector_eq.c > > @@ -0,0 +1,26 @@ > > +/* { dg-do compile { target x86_64-*-* } } */ > > target { { i?86-*-* x86_64-*-* } && lp64 } Thanks, changed. > > Uros. > > > +/* { dg-additional-options "-O2 -march=x86-64-v3" } */ > > + > > +typedef int v4si __attribute__((vector_size(16))); > > + > > +v4si __RTL (startwith ("vregs")) foo (void) > > +{ > > +(function "foo" > > + (insn-chain > > +(block 2 > > + (edge-from entry (flags "FALLTHRU")) > > + (cnote 1 [bb 2] NOTE_INSN_BASIC_BLOCK) > > + (cnote 2 NOTE_INSN_FUNCTION_BEG) > > + (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) > > (const_int 0) (const_int 0) (const_int 0)]))) > > + (cinsn 5 (set (reg:V4SI <2>) > > + (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1> > > + (cinsn 6 (set (reg:V4SI <3>) (reg:V4SI <2>))) > > + (cinsn 7 (set (reg:V4SI xmm0) (reg:V4SI <3>))) > > + (edge-to exit (flags "FALLTHRU")) > > +) > > + ) > > + (crtl (return_rtx (reg/i:V4SI xmm0))) > > +) > > +} > > + > > +/* { dg-final { scan-assembler-not "vpxor" } } */ > > -- > > 2.31.1 > > -- BR, Hongtao
Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.
On Thu, Sep 19, 2024 at 2:08 PM Richard Biener wrote: > > On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford > wrote: > > > > Richard Biener writes: > > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu wrote: > > >> > > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu wrote: > > >> > > > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > > >> > wrote: > > >> > > > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt > > >> > > wrote: > > >> > > > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model > > >> > > > which is restricted > > >> > > > to constant tripcount. The vectorization capacity is very limited > > >> > > > w/ consideration > > >> > > > of codesize impact. > > >> > > > > > >> > > > The patch extends the very cheap cost model a little bit to > > >> > > > support variable tripcount. > > >> > > > But still disable peeling for gaps/alignment, runtime aliasing > > >> > > > checking and epilogue > > >> > > > vectorization with the consideration of codesize. > > >> > > > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one > > >> > > > vectorized main loop > > >> > > > , one scalar/remainder loop. > > >> > > > > > >> > > > .i.e. > > >> > > > > > >> > > > void > > >> > > > foo1 (int* __restrict a, int* b, int* c, int n) > > >> > > > { > > >> > > > for (int i = 0; i != n; i++) > > >> > > > a[i] = b[i] + c[i]; > > >> > > > } > > >> > > > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to > > >> > > > > > >> > > > .L10: > > >> > > > vmovdqu (%r8,%rax), %ymm0 > > >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > >> > > > vmovdqu %ymm0, (%rdi,%rax) > > >> > > > addq$32, %rax > > >> > > > cmpq%rdx, %rax > > >> > > > jne .L10 > > >> > > > movl%ecx, %eax > > >> > > > andl$-8, %eax > > >> > > > cmpl%eax, %ecx > > >> > > > je .L21 > > >> > > > vzeroupper > > >> > > > .L12: > > >> > > > movl(%r8,%rax,4), %edx > > >> > > > addl(%rsi,%rax,4), %edx > > >> > > > movl%edx, (%rdi,%rax,4) > > >> > > > addq$1, %rax > > >> > > > cmpl%eax, %ecx > > >> > > > jne .L12 > > >> > > > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves > > >> > > > performance by 4.11% > > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance > > >> > > > by 5.74% with > > >> > > > extra 8.88% codesize. The details are as below > > >> > > > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > >> > > model numbers? > > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > >> > > > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > >> > > > > > >> > > > N-Iter cheap cost model > > >> > > > 500.perlbench_r -0.12% -0.12% > > >> > > > 502.gcc_r 0.44% -0.11% > > >> > > > 505.mcf_r 0.17% 4.46% > > >> > > > 520.omnetpp_r 0.28% -0.27% > > >> > > > 523.xalancbmk_r 0.00% 5.93% > > >> > > > 525.x264_r -0.09% 23.53% > > >> > > > 531.deepsjeng_r 0.19% 0.00% > > >> > > > 541.leela_r 0.22% 0.00% > > >> > > > 548.exchange2_r -11.54% -22.34% > > >> > > > 557
Re: [PATCH] doc: Add more alias option and reorder Intel CPU -march documentation
On Wed, Sep 18, 2024 at 1:35 PM Haochen Jiang wrote: > > Hi all, > > Since r15-3539, there are requests coming in to add other alias option > documentation. This patch will add all ot them, including corei7, corei7-avx, > core-avx-i, core-avx2, atom, slm, gracemont and emerarldrapids. > > Also in the patch, I reordered that part of documentation, currently all > the CPUs/products are just all over the place. I regrouped them by > date-to-now products (since the very first CPU to latest Panther Lake), P-core > (since the clients become hybrid cores, starting from Sapphire Rapids) and > E-core (since Bonnell to latest Clearwater Forest). > > And in the patch, I refined the product names in documentation. > > Ok for trunk? Ok, please backport to release branch. > > Thx, > Haochen > > gcc/ChangeLog: > > * doc/invoke.texi: Add corei7, corei7-avx, core-avx-i, > core-avx2, atom, slm, gracemont and emerarldrapids. Reorder > the -march documentation by splitting them into date-to-now > products, P-core and E-core. Refine the product names in > documentation. > --- > gcc/doc/invoke.texi | 234 +++- > 1 file changed, 121 insertions(+), 113 deletions(-) > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index a6cd5111d47..23e1d8577e7 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -34598,6 +34598,7 @@ Intel Core 2 CPU with 64-bit extensions, MMX, SSE, > SSE2, SSE3, SSSE3, CX16, > SAHF and FXSR instruction set support. > > @item nehalem > +@itemx corei7 > Intel Nehalem CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, > SSE4.1, SSE4.2, POPCNT, CX16, SAHF and FXSR instruction set support. > > @@ -34606,16 +34607,19 @@ Intel Westmere CPU with 64-bit extensions, MMX, > SSE, SSE2, SSE3, SSSE3, > SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR and PCLMUL instruction set support. > > @item sandybridge > +@itemx corei7-avx > Intel Sandy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, > SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE and PCLMUL instruction > set > support. > > @item ivybridge > +@itemx core-avx-i > Intel Ivy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, > SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND > and F16C instruction set support. > > @item haswell > +@itemx core-avx2 > Intel Haswell CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, > SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, > RDRND, > F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE and HLE instruction set support. > @@ -34632,61 +34636,6 @@ SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, > XSAVE, PCLMUL, FSGSBASE, RDRND, > F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW, AES, > CLFLUSHOPT, XSAVEC, XSAVES and SGX instruction set support. > > -@item bonnell > -Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and > SSSE3 > -instruction set support. > - > -@item silvermont > -Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, > SSSE3, > -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW and RDRND > -instruction set support. > - > -@item goldmont > -Intel Goldmont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, > SSSE3, > -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, AES, SHA, > -RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT and FSGSBASE instruction > -set support. > - > -@item goldmont-plus > -Intel Goldmont Plus CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, > -SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, > AES, > -SHA, RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT, FSGSBASE, PTWRITE, > -RDPID and SGX instruction set support. > - > -@item tremont > -Intel Tremont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, > -SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW, RDRND, AES, SHA, > -RDSEED, XSAVE, XSAVEC, XSAVES, XSAVEOPT, CLFLUSHOPT, FSGSBASE, PTWRITE, > RDPID, > -SGX, CLWB, GFNI-SSE, MOVDIRI, MOVDIR64B, CLDEMOTE and WAITPKG instruction set > -support. > - > -@item sierraforest > -Intel Sierra Forest CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, > -SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC, > -XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI, > -MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT, > -PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI, > -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD and UINTR instruction > set > -support. > - > -@item grandridge > -Intel Grand Ridge CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, > -SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL, RDRND, XSAVE, XSAVEC, > -XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI, > -MOVDIR64B, CLDEMOTE,
Re: [PATCH] i386: Add missing avx512f-mask-type.h include
On Wed, Sep 18, 2024 at 1:40 PM Haochen Jiang wrote: > > Hi all, > > Since commit r15-3594, we fixed the bugs in MASK_TYPE for AVX10.2 > testcases, but we missed the following four. > > The tests are not FAIL since the binutils part haven't been merged > yet, which leads to UNSUPPORTED test. But the avx512f-mask-type.h > needs to be included, otherwise, it will be compile error. > > Tested with asseblmer having those insts and sde. Ok for trunk? Ok. > > Thx, > Haochen > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-512-vpdpbssd-2.c: Include > avx512f-mask-type.h. > * gcc.target/i386/avx10_2-vminmaxsd-2.c: Ditto. > * gcc.target/i386/avx10_2-vminmaxsh-2.c: Ditto. > * gcc.target/i386/avx10_2-vminmaxss-2.c: Ditto. > --- > gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c | 2 ++ > gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c| 1 + > gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c| 1 + > gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c| 1 + > 4 files changed, 5 insertions(+) > > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c > index add9de89351..624a1a8e50e 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vpdpbssd-2.c > @@ -13,6 +13,8 @@ > #define SRC_SIZE (AVX512F_LEN / 8) > #define SIZE (AVX512F_LEN / 32) > > +#include "avx512f-mask-type.h" > + > static void > CALC (int *r, int *dst, char *s1, char *s2) > { > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c > index 1e2d78c4068..f550e09be6c 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsd-2.c > @@ -8,6 +8,7 @@ > #include "avx10-helper.h" > #include > #include "avx10-minmax-helper.h" > +#include "avx512f-mask-type.h" > > void static > CALC (double *r, double *s1, double *s2, int R) > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c > index e6a93c403b5..dbf1087d9c3 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxsh-2.c > @@ -8,6 +8,7 @@ > #include "avx10-helper.h" > #include > #include "avx10-minmax-helper.h" > +#include "avx512f-mask-type.h" > > void static > CALC (_Float16 *r, _Float16 *s1, _Float16 *s2, int R) > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c > index 47177e69640..7baa396a2d3 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-vminmaxss-2.c > @@ -8,6 +8,7 @@ > #include "avx10-helper.h" > #include > #include "avx10-minmax-helper.h" > +#include "avx512f-mask-type.h" > > void static > CALC (float *r, float *s1, float *s2, int R) > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Enhance AVX10.2 convert tests
On Wed, Sep 18, 2024 at 1:42 PM Haochen Jiang wrote: > > Hi all, > > For AVX10.2 convert tests, all of them are missing mask tests > previously, this patch will add them in the tests. > > Tested on sde with assembler with these insts. Ok for trunk? Ok. > > Thx, > Haochen > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c: Enhance mask test. > * gcc.target/i386/avx10_2-512-vcvthf82ph-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2bf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtne2ph2hf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2bf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2bf8s-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2hf8-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtneph2hf8s-2.c: Ditto. > * gcc.target/i386/avx512f-helper.h: Fix a typo in macro define. > --- > .../i386/avx10_2-512-vcvt2ps2phx-2.c | 35 --- > .../i386/avx10_2-512-vcvthf82ph-2.c | 27 ++ > .../i386/avx10_2-512-vcvtne2ph2bf8-2.c| 25 ++--- > .../i386/avx10_2-512-vcvtne2ph2bf8s-2.c | 25 ++--- > .../i386/avx10_2-512-vcvtne2ph2hf8-2.c| 25 ++--- > .../i386/avx10_2-512-vcvtne2ph2hf8s-2.c | 25 ++--- > .../i386/avx10_2-512-vcvtneph2bf8-2.c | 29 ++- > .../i386/avx10_2-512-vcvtneph2bf8s-2.c| 27 ++ > .../i386/avx10_2-512-vcvtneph2hf8-2.c | 27 ++ > .../i386/avx10_2-512-vcvtneph2hf8s-2.c| 27 ++ > .../gcc.target/i386/avx512f-helper.h | 2 +- > 11 files changed, 209 insertions(+), 65 deletions(-) > > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c > index 40dbe18abbe..5e355ae53d4 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvt2ps2phx-2.c > @@ -10,24 +10,25 @@ > #include "avx10-helper.h" > #include > > -#define SIZE_RES (AVX512F_LEN / 16) > +#define SIZE (AVX512F_LEN / 16) > +#include "avx512f-mask-type.h" > > static void > CALC (_Float16 *res_ref, float *src1, float *src2) > { >float fp32; >int i; > - for (i = 0; i < SIZE_RES / 2; i++) > + for (i = 0; i < SIZE / 2; i++) > { >fp32 = (float) 2 * i + 7 + i * 0.5; >res_ref[i] = fp32; >src2[i] = fp32; > } > - for (i = SIZE_RES / 2; i < SIZE_RES; i++) > + for (i = SIZE / 2; i < SIZE; i++) > { >fp32 = (float)2 * i + 7 + i * 0.5; >res_ref[i] = fp32; > - src1[i - (SIZE_RES / 2)] = fp32; > + src1[i - (SIZE / 2)] = fp32; > } > } > > @@ -35,17 +36,27 @@ void > TEST (void) > { >int i; > - UNION_TYPE (AVX512F_LEN, h) res1; > + UNION_TYPE (AVX512F_LEN, h) res1, res2, res3; >UNION_TYPE (AVX512F_LEN, ) src1, src2; > - _Float16 res_ref[SIZE_RES]; > - float fp32; > - > - for (i = 0; i < SIZE_RES; i++) > -res1.a[i] = 5; > - > + MASK_TYPE mask = MASK_VALUE; > + _Float16 res_ref[SIZE]; > + > + for (i = 0; i < SIZE; i++) > +res2.a[i] = DEFAULT_VALUE; > + >CALC (res_ref, src1.a, src2.a); > - > + >res1.x = INTRINSIC (_cvtx2ps_ph) (src1.x, src2.x); >if (UNION_CHECK (AVX512F_LEN, h) (res1, res_ref)) > abort (); > + > + res2.x = INTRINSIC (_mask_cvtx2ps_ph) (res2.x, mask, src1.x, src2.x); > + MASK_MERGE (h) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, h) (res2, res_ref)) > +abort (); > + > + res3.x = INTRINSIC (_maskz_cvtx2ps_ph) (mask, src1.x, src2.x); > + MASK_ZERO (h) (res_ref, mask, SIZE); > + if (UNION_CHECK (AVX512F_LEN, h) (res3, res_ref)) > +abort (); > } > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c > b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c > index 6b9f07ff86a..1aa5daa6c58 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-vcvthf82ph-2.c > @@ -12,13 +12,14 @@ > #include "fp8-helper.h" > > #define SIZE_SRC (AVX512F_LEN_HALF / 8) > -#define SIZE_RES (AVX512F_LEN / 16) > +#define SIZE (AVX512F_LEN / 16) > +#include "avx512f-mask-type.h" > > void > CALC (_Float16 *r, unsigned char *s) > { >int i; > - for (i = 0; i < SIZE_RES; i++) > + for (i = 0; i < SIZE; i++) > r[i] = convert_hf8_to_fp16(s[i]); > } > > @@ -26,9 +27,10 @@ void > TEST (void) > { >int i,sign; > - UNION_TYPE (AVX512F_LEN, h) res; > + UNION_TYPE (AVX512F_LEN, h) res1, res2, res3; >UNION_TYPE (AVX512F_LEN_HALF, i_b) src; > - _Float16 res_ref[SIZE_RES]; > + MASK_TYPE mask = MASK_VALUE; > + _Float16 res_ref[SIZE]; > >sign = 1; >for (i = 0; i < SIZE_SRC; i++) > @@ -37,9 +39,22 @@ TEST (void) >sign = -sign; >
Re: [PATCH] i386: Add ssemov2, sseicvt2 for some load instructions that use memory on operand2
On Thu, Sep 19, 2024 at 9:34 AM Hu, Lin1 wrote: > > Hi, all > > The memory attr of some instructions should be 'load', but these is 'none' > currently. > > This patch add two new types ssemov2, sseicvt2 for some load instructions that > use memory on operands. So their memory attr will be 'load'. > > Bootstrapped and Regtested on x86-64-pc-linux-gnu, OK for trunk? Ok. > > BRs > Lin > > gcc/ChangeLog: > > * config/i386/i386.md: Add ssemov2, sseicvt2. > * config/i386/sse.md (sse2_cvtsi2sd): Apply sseicvt2. > (sse2_cvtsi2sdq): Ditto. > (vec_set_0): Apply ssemov2 for 4, 6. > --- > gcc/config/i386/i386.md | 11 +++ > gcc/config/i386/sse.md | 6 -- > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index c0441514949..9c2a0aa6112 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -539,10 +539,10 @@ (define_attr "type" > str,bitmanip, > fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp, > fxch,fistp,fisttp,frndint, > - sse,ssemov,sseadd,sseadd1,sseiadd,sseiadd1, > + sse,ssemov,ssemov2,sseadd,sseadd1,sseiadd,sseiadd1, > ssemul,sseimul,ssediv,sselog,sselog1, > sseishft,sseishft1,ssecmp,ssecomi, > - ssecvt,ssecvt1,sseicvt,sseins, > + ssecvt,ssecvt1,sseicvt,sseicvt2,sseins, > sseshuf,sseshuf1,ssemuladd,sse4arg, > lwp,mskmov,msklog, > mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft" > @@ -560,10 +560,10 @@ (define_attr "unit" "integer,i387,sse,mmx,unknown" >(cond [(eq_attr "type" "fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp, > fxch,fistp,fisttp,frndint") >(const_string "i387") > -(eq_attr "type" "sse,ssemov,sseadd,sseadd1,sseiadd,sseiadd1, > +(eq_attr "type" "sse,ssemov,ssemov2,sseadd,sseadd1,sseiadd,sseiadd1, > ssemul,sseimul,ssediv,sselog,sselog1, > sseishft,sseishft1,ssecmp,ssecomi, > - ssecvt,ssecvt1,sseicvt,sseins, > + ssecvt,ssecvt1,sseicvt,sseicvt2,sseins, > sseshuf,sseshuf1,ssemuladd,sse4arg,mskmov") >(const_string "sse") > (eq_attr "type" "mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft") > @@ -858,6 +858,9 @@ (define_attr "memory" "none,load,store,both,unknown" >mmx,mmxmov,mmxcmp,mmxcvt,mskmov,msklog") > (match_operand 2 "memory_operand")) >(const_string "load") > +(and (eq_attr "type" "ssemov2,sseicvt2") > + (match_operand 2 "memory_operand")) > + (const_string "load") > (and (eq_attr "type" "icmov,ssemuladd,sse4arg") > (match_operand 3 "memory_operand")) >(const_string "load") > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 1ae61182d0c..ff4f33b7b63 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -8876,7 +8876,7 @@ (define_insn "sse2_cvtsi2sd" > cvtsi2sd{l}\t{%2, %0|%0, %2} > vcvtsi2sd{l}\t{%2, %1, %0|%0, %1, %2}" >[(set_attr "isa" "noavx,noavx,avx") > - (set_attr "type" "sseicvt") > + (set_attr "type" "sseicvt2") > (set_attr "athlon_decode" "double,direct,*") > (set_attr "amdfam10_decode" "vector,double,*") > (set_attr "bdver1_decode" "double,direct,*") > @@ -8898,7 +8898,7 @@ (define_insn "sse2_cvtsi2sdq" > cvtsi2sd{q}\t{%2, %0|%0, %2} > vcvtsi2sd{q}\t{%2, %1, %0|%0, %1, %2}" >[(set_attr "isa" "noavx,noavx,avx") > - (set_attr "type" "sseicvt") > + (set_attr "type" "sseicvt2") > (set_attr "athlon_decode" "double,direct,*") > (set_attr "amdfam10_decode" "vector,double,*") > (set_attr "bdver1_decode" "double,direct,*") > @@ -11808,6 +11808,8 @@ (define_insn "vec_set_0" > (const_string "imov") > (eq_attr "alternative" "14") > (const_string "fmov") > + (eq_attr "alternative" "4,6") > + (const_string "ssemov2") >] >(const_string "ssemov"))) > (set (attr "addr") > -- > 2.31.1 > -- BR, Hongtao
Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.
On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu wrote: > > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > wrote: > > > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt wrote: > > > > > > GCC12 enables vectorization for O2 with very cheap cost model which is > > > restricted > > > to constant tripcount. The vectorization capacity is very limited w/ > > > consideration > > > of codesize impact. > > > > > > The patch extends the very cheap cost model a little bit to support > > > variable tripcount. > > > But still disable peeling for gaps/alignment, runtime aliasing checking > > > and epilogue > > > vectorization with the consideration of codesize. > > > > > > So there're at most 2 versions of loop for O2 vectorization, one > > > vectorized main loop > > > , one scalar/remainder loop. > > > > > > .i.e. > > > > > > void > > > foo1 (int* __restrict a, int* b, int* c, int n) > > > { > > > for (int i = 0; i != n; i++) > > > a[i] = b[i] + c[i]; > > > } > > > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > > > .L10: > > > vmovdqu (%r8,%rax), %ymm0 > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > > vmovdqu %ymm0, (%rdi,%rax) > > > addq$32, %rax > > > cmpq%rdx, %rax > > > jne .L10 > > > movl%ecx, %eax > > > andl$-8, %eax > > > cmpl%eax, %ecx > > > je .L21 > > > vzeroupper > > > .L12: > > > movl(%r8,%rax,4), %edx > > > addl(%rsi,%rax,4), %edx > > > movl%edx, (%rdi,%rax,4) > > > addq$1, %rax > > > cmpl%eax, %ecx > > > jne .L12 > > > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance > > > by 4.11% > > > with extra 2.8% codeisze, and cheap cost model improve performance by > > > 5.74% with > > > extra 8.88% codesize. The details are as below > > > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > model numbers? > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > > > N-Iter cheap cost model > > > 500.perlbench_r -0.12% -0.12% > > > 502.gcc_r 0.44% -0.11% > > > 505.mcf_r 0.17% 4.46% > > > 520.omnetpp_r 0.28% -0.27% > > > 523.xalancbmk_r 0.00% 5.93% > > > 525.x264_r -0.09% 23.53% > > > 531.deepsjeng_r 0.19% 0.00% > > > 541.leela_r 0.22% 0.00% > > > 548.exchange2_r -11.54% -22.34% > > > 557.xz_r0.74% 0.49% > > > GEOMEAN INT -1.04% 0.60% > > > > > > 503.bwaves_r3.13% 4.72% > > > 507.cactuBSSN_r 1.17% 0.29% > > > 508.namd_r 0.39% 6.87% > > > 510.parest_r3.14% 8.52% > > > 511.povray_r0.10% -0.20% > > > 519.lbm_r -0.68% 10.14% > > > 521.wrf_r 68.20% 76.73% > > > > So this seems to regress as well? > Niter increases performance less than the cheap cost model, that's > expected, it is not a regression. > > > > > 526.blender_r 0.12% 0.12% > > > 527.cam4_r 19.67% 23.21% > > > 538.imagick_r 0.12% 0.24% > > > 544.nab_r 0.63% 0.53% > > > 549.fotonik3d_r 14.44% 9.43% > > > 554.roms_r 12.39% 0.00% > > > GEOMEAN FP 8.26% 9.41% > > > GEOMEAN ALL 4.11% 5.74% I've tested the patch on aarch64, it shows similar improvement with little codesize increasement. I haven't tested it on other backends, but I think it would have similar good improvements > > > > > > Code sise impact > > > N-Iter cheap cost model > > > 500.perlbench_r 0.22% 1.03% > > > 502.gcc_r 0.25% 0.60% > > > 505.mcf_r 0.00% 32.07% > > > 520.omnetpp_r 0.09% 0.31% > > > 523.xalancbmk_r 0.08% 1.86% > > > 525.x264_r 0.75% 7.96% > > > 531.deepsjeng_
Re: [PATCH v2] Enable V2BF/V4BF vec_cmp with AVX10.2 vcmppbf16
On Thu, Sep 12, 2024 at 9:55 AM Levy Hsu wrote: > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386.cc (ix86_get_mask_mode): > Enable BFmode for targetm.vectorize.get_mask_mode with AVX10.2. > * config/i386/mmx.md (vec_cmpqi): > Implement vec_cmpv2bfqi and vec_cmpv4bfqi. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/part-vect-vec_cmpbf.c: New test. > --- > gcc/config/i386/i386.cc | 3 ++- > gcc/config/i386/mmx.md| 17 > .../gcc.target/i386/part-vect-vec_cmpbf.c | 26 +++ > 3 files changed, 45 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 45320124b91..7dbae1d72e3 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -24682,7 +24682,8 @@ ix86_get_mask_mode (machine_mode data_mode) >/* AVX512FP16 only supports vector comparison > to kmask for _Float16. */ >|| (TARGET_AVX512VL && TARGET_AVX512FP16 > - && GET_MODE_INNER (data_mode) == E_HFmode)) > + && GET_MODE_INNER (data_mode) == E_HFmode) > + || (TARGET_AVX10_2_256 && GET_MODE_INNER (data_mode) == E_BFmode)) > { >if (elem_size == 4 > || elem_size == 8 > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index 4bc191b874b..95d9356694a 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -2290,6 +2290,23 @@ >DONE; > }) > > +;;This instruction does not generate floating point exceptions > +(define_expand "vec_cmpqi" > + [(set (match_operand:QI 0 "register_operand") > + (match_operator:QI 1 "" > + [(match_operand:VBF_32_64 2 "register_operand") > + (match_operand:VBF_32_64 3 "nonimmediate_operand")]))] > + "TARGET_AVX10_2_256" > +{ > + rtx op2 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[2]), mode); > + rtx op3 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[3]), mode); > + > + emit_insn (gen_vec_cmpv8bfqi (operands[0], operands[1], op2, op3)); > + DONE; > +}) > + > ; > ;; > ;; Parallel half-precision floating point rounding operations. > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c > b/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c > new file mode 100644 > index 000..0bb720b6432 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/part-vect-vec_cmpbf.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -mavx10.2" } */ > +/* { dg-final { scan-assembler-times "vcmppbf16" 10 } } */ > + > +typedef __bf16 __attribute__((__vector_size__ (4))) v2bf; > +typedef __bf16 __attribute__((__vector_size__ (8))) v4bf; > + > + > +#define VCMPMN(type, op, name) \ > +type \ > +__attribute__ ((noinline, noclone)) \ > +vec_cmp_##type##type##name (type a, type b) \ > +{ \ > + return a op b; \ > +} > + > +VCMPMN (v4bf, <, lt) > +VCMPMN (v2bf, <, lt) > +VCMPMN (v4bf, <=, le) > +VCMPMN (v2bf, <=, le) > +VCMPMN (v4bf, >, gt) > +VCMPMN (v2bf, >, gt) > +VCMPMN (v4bf, >=, ge) > +VCMPMN (v2bf, >=, ge) > +VCMPMN (v4bf, ==, eq) > +VCMPMN (v2bf, ==, eq) > -- > 2.31.1 > -- BR, Hongtao
Re: [RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.
On Wed, Sep 11, 2024 at 4:04 PM Richard Biener wrote: > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt wrote: > > > > GCC12 enables vectorization for O2 with very cheap cost model which is > > restricted > > to constant tripcount. The vectorization capacity is very limited w/ > > consideration > > of codesize impact. > > > > The patch extends the very cheap cost model a little bit to support > > variable tripcount. > > But still disable peeling for gaps/alignment, runtime aliasing checking and > > epilogue > > vectorization with the consideration of codesize. > > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized > > main loop > > , one scalar/remainder loop. > > > > .i.e. > > > > void > > foo1 (int* __restrict a, int* b, int* c, int n) > > { > > for (int i = 0; i != n; i++) > > a[i] = b[i] + c[i]; > > } > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > .L10: > > vmovdqu (%r8,%rax), %ymm0 > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > vmovdqu %ymm0, (%rdi,%rax) > > addq$32, %rax > > cmpq%rdx, %rax > > jne .L10 > > movl%ecx, %eax > > andl$-8, %eax > > cmpl%eax, %ecx > > je .L21 > > vzeroupper > > .L12: > > movl(%r8,%rax,4), %edx > > addl(%rsi,%rax,4), %edx > > movl%edx, (%rdi,%rax,4) > > addq$1, %rax > > cmpl%eax, %ecx > > jne .L12 > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by > > 4.11% > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% > > with > > extra 8.88% codesize. The details are as below > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > model numbers? No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > N-Iter cheap cost model > > 500.perlbench_r -0.12% -0.12% > > 502.gcc_r 0.44% -0.11% > > 505.mcf_r 0.17% 4.46% > > 520.omnetpp_r 0.28% -0.27% > > 523.xalancbmk_r 0.00% 5.93% > > 525.x264_r -0.09% 23.53% > > 531.deepsjeng_r 0.19% 0.00% > > 541.leela_r 0.22% 0.00% > > 548.exchange2_r -11.54% -22.34% > > 557.xz_r0.74% 0.49% > > GEOMEAN INT -1.04% 0.60% > > > > 503.bwaves_r3.13% 4.72% > > 507.cactuBSSN_r 1.17% 0.29% > > 508.namd_r 0.39% 6.87% > > 510.parest_r3.14% 8.52% > > 511.povray_r0.10% -0.20% > > 519.lbm_r -0.68% 10.14% > > 521.wrf_r 68.20% 76.73% > > So this seems to regress as well? Niter increases performance less than the cheap cost model, that's expected, it is not a regression. > > > 526.blender_r 0.12% 0.12% > > 527.cam4_r 19.67% 23.21% > > 538.imagick_r 0.12% 0.24% > > 544.nab_r 0.63% 0.53% > > 549.fotonik3d_r 14.44% 9.43% > > 554.roms_r 12.39% 0.00% > > GEOMEAN FP 8.26% 9.41% > > GEOMEAN ALL 4.11% 5.74% > > > > Code sise impact > > N-Iter cheap cost model > > 500.perlbench_r 0.22% 1.03% > > 502.gcc_r 0.25% 0.60% > > 505.mcf_r 0.00% 32.07% > > 520.omnetpp_r 0.09% 0.31% > > 523.xalancbmk_r 0.08% 1.86% > > 525.x264_r 0.75% 7.96% > > 531.deepsjeng_r 0.72% 3.28% > > 541.leela_r 0.18% 0.75% > > 548.exchange2_r 8.29% 12.19% > > 557.xz_r0.40% 0.60% > > GEOMEAN INT 1.07%% 5.71% > > > > 503.bwaves_r12.89% 21.59% > > 507.cactuBSSN_r 0.90% 20.19% > > 508.namd_r 0.77% 14.75% > > 510.parest_r0.91% 3.91% > > 511.povray_r0.45% 4.08% > > 519.lbm_r 0.00% 0.00% > > 521.wrf_r 5.97% 12.79% > > 526.blender_r 0.49% 3.84% > > 527.cam4_r 1.39% 3.28% > > 538.imagick_r 1.86% 7.78% > > 544.nab_r 0.41% 3.00% > > 549.fotonik3d_r 25.50% 47.47% > > 554.roms_r 5.17% 13.01% > > GEOMEAN FP 4.14% 11.38% > > GEOMEAN ALL 2.80% 8.88% > > > > > > The only regression is from 548.exchange_r, the vectorization for inner > > loop in each layer > > of the 9-layer loops increases register pressure and causes more spill. > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > . > > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > > ... > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > Looks like aarch64 doesn't have the issue because aarch64 has
Re: [PATCH] i386: Fix incorrect avx512f-mask-type.h include
On Thu, Sep 5, 2024 at 10:05 AM Haochen Jiang wrote: > > Hi all, > > In avx512f-mask-type.h, we need SIZE being defined to get > MASK_TYPE defined correctly. Fix those testcases where > SIZE are not defined before the include for avv512f-mask-type.h. > > Note that for convert intrins in AVX10.2, they will need more > modifications due to the current tests did not include mask ones. > They will be in a seperate patch. > > Tested on x86-64-pc-linux-gnu. Ok for trunk? Ok. > > Thx, > Haochen > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10-helper.h: Do not include > avx512f-mask-type.h. > * gcc.target/i386/avx10_2-512-vaddnepbf16-2.c: > Define SIZE and include avx512f-mask-type.h. > * gcc.target/i386/avx10_2-512-vcmppbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtnebf162ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtnebf162iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtph2ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtph2iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtps2ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvtps2iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttnebf162ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttnebf162iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttpd2dqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttpd2qqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttpd2udqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttpd2uqqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttph2ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttph2iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2dqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2ibs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2iubs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2qqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2udqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vcvttps2uqqs-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vdivnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vdpphps-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vfmaddXXXnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vfmsubXXXnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vfnmaddXXXnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vfnmsubXXXnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vfpclasspbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vgetexppbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vgetmantpbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vmaxpbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vminmaxnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vminmaxpd-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vminmaxph-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vminmaxps-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vminpbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vmpsadbw-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vmulnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbssd-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbssds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbsud-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbsuds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbuud-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpbuuds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwsud-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwsuds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwusd-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwusds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwuud-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vpdpwuuds-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vrcppbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vreducenepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vrndscalenepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vrsqrtpbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vscalefpbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vsqrtnepbf16-2.c: Ditto. > * gcc.target/i386/avx10_2-512-vsubnepbf16-2.c: Ditto. > * gcc.target/i386/avx512fp16-vfpclassph-1b.c: Ditto. > --- > gcc/testsuite/gcc.target/i386/avx10-helper.h | 1 - > .../i386/avx10_2-512-vaddnepbf16-2.c | 11 +- > .../gcc.target/i386/avx10_2-512-vcmppbf16-2.c | 5 +++-- > .../i386/avx10_2-512-vcvtnebf162ibs-2.c | 16 +++--- > .../i386/avx10_2-512-vcvtnebf162iubs-2.c | 16 +++--- > .../i386/avx10_2-512-vcvtph2ibs-2.c | 16 +++--- > .../i386/avx10_2-512-vcvtph2iubs-2.c | 16 +++--- > .../i386/avx10_2-512-vcvtps2ibs-2.c | 16 +++--- > .../i386/avx10_2-
Re: [PATCH] x86: Refine V4BF/V2BF FMA Testcase
On Tue, Sep 10, 2024 at 3:35 PM Levy Hsu wrote: > > Simple testcase fix, ok for trunk? Ok. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: Separated 32-bit > scan > and removed register checks in spill situations. > --- > .../i386/avx10_2-partial-bf-vector-fma-1.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > index 72e17e99603..8a9096a300a 100644 > --- a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > @@ -1,9 +1,13 @@ > /* { dg-do compile } */ > /* { dg-options "-mavx10.2 -O2" } */ > -/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > -/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > -/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > -/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" > 3 { target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" > 3 { target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" > 3 { target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" > 3 { target ia32 } } } */ > +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" > 2 { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" > 2 { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" > 2 { target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" > 2 { target { ! ia32 } } } } */ > > typedef __bf16 v4bf __attribute__ ((__vector_size__ (8))); > typedef __bf16 v2bf __attribute__ ((__vector_size__ (4))); > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] x86: Refine V4BF/V2BF FMA testcase
On Fri, Sep 6, 2024 at 10:34 AM Jiang, Haochen wrote: > > > From: Levy Hsu > > Sent: Thursday, September 5, 2024 4:55 PM > > To: gcc-patches@gcc.gnu.org > > > > Simple testcase fix, ok for trunk? > > > > This patch removes specific register checks to account for possible > > register spills and disables tests in 32-bit mode. This adjustment > > is necessary because V4BF operations in 32-bit mode require duplicating > > instructions, which lead to unintended test failures. It fixed the > > case when testing with --target_board='unix{-m32\ -march=cascadelake}' > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: Remove specific > > register checks to account for potential register spills. Exclude > > tests > > in 32-bit mode to prevent incorrect failure reports due to the need > > for > > multiple instruction executions in handling V4BF operations. > > --- > > .../gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c | 8 > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > > index 72e17e99603..17c32c1d36b 100644 > > --- a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > > @@ -1,9 +1,9 @@ > > /* { dg-do compile } */ > > You could simply add { target { ! ia32 } } here, but not each line of > scan-assembler-times. It can be compiled at target ia32, I guess for ia32, fma instructions can be scanned for 3 times(1 for original 32-bit vector fma, 2 from split 64-bit vector fma to 2 32-bit vector fma) So better to scan 2 fma for ! ia32, 3 fma for ia32? > > I don't think we need this test been run for -m32 due to V4BF. Actually > the better choice is to split the testcase to two part, for V2BF, I suppose > it could be run under -m32. > > Thx, > Haochen > > > /* { dg-options "-mavx10.2 -O2" } */ > > -/* { dg-final { scan-assembler-times > > "vfmadd132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0- > > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */ > > -/* { dg-final { scan-assembler-times > > "vfmsub132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0- > > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */ > > -/* { dg-final { scan-assembler-times > > "vfnmadd132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0- > > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */ > > -/* { dg-final { scan-assembler-times > > "vfnmsub132nepbf16\[ \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0- > > 9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */ > > +/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[^\n\r\]*xmm\[0- > > 9\]" 2 { target { ! ia32 } } } } */ > > +/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[^\n\r\]*xmm\[0- > > 9\]" 2 { target { ! ia32 } } } } */ > > +/* { dg-final { scan-assembler-times > > "vfnmadd132nepbf16\[^\n\r\]*xmm\[0-9\]" 2 { target { ! ia32 } } } } */ > > +/* { dg-final { scan-assembler-times > > "vfnmsub132nepbf16\[^\n\r\]*xmm\[0-9\]" 2 { target { ! ia32 } } } } */ > > > > typedef __bf16 v4bf __attribute__ ((__vector_size__ (8))); > > typedef __bf16 v2bf __attribute__ ((__vector_size__ (4))); > > -- > > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Integrate BFmode for Enhanced Vectorization in ix86_preferred_simd_mode
On Wed, Sep 4, 2024 at 9:32 AM Levy Hsu wrote: > > Hi > > This change adds BFmode support to the ix86_preferred_simd_mode function > enhancing SIMD vectorization for BF16 operations. The update ensures > optimized usage of SIMD capabilities improving performance and aligning > vector sizes with processor capabilities. > > Bootstrapped and tested on x86-64-pc-linux-gnu. > Ok for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386.cc (ix86_preferred_simd_mode): Add BFmode Support. > --- > gcc/config/i386/i386.cc | 8 > 1 file changed, 8 insertions(+) > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 7af9ceca429..aea138c85ad 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -24570,6 +24570,14 @@ ix86_preferred_simd_mode (scalar_mode mode) > } >return word_mode; > > +case E_BFmode: > + if (TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256) > + return V32BFmode; > + else if (TARGET_AVX && !TARGET_PREFER_AVX128) > + return V16BFmode; > + else > + return V8BFmode; > + > case E_SFmode: >if (TARGET_AVX512F && TARGET_EVEX512 && !TARGET_PREFER_AVX256) > return V16SFmode; > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Support partial signbit/xorsign/copysign/abs/neg/and/xor/ior/andn for V2BF/V4BF
On Wed, Sep 4, 2024 at 10:53 AM Levy Hsu wrote: > > Hi > > This patch adds support for bf16 operations in V2BF and V4BF modes on i386, > handling signbit, xorsign, copysign, abs, neg, and various logical operations. > > Bootstrapped and tested on x86-64-pc-linux-gnu. > Ok for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386.cc (ix86_build_const_vector): Add V2BF/V4BF. > (ix86_build_signbit_mask): Add V2BF/V4BF. > * config/i386/mmx.md: Modified supported logic op to use VHBF_32_64. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/part-vect-absnegbf.c: New test. > --- > gcc/config/i386/i386.cc | 4 + > gcc/config/i386/mmx.md| 74 + > .../gcc.target/i386/part-vect-absnegbf.c | 81 +++ > 3 files changed, 124 insertions(+), 35 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-absnegbf.c > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 78bf890f14b..2bbfb1bf5fc 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -16176,6 +16176,8 @@ ix86_build_const_vector (machine_mode mode, bool > vect, rtx value) > case E_V32BFmode: > case E_V16BFmode: > case E_V8BFmode: > +case E_V4BFmode: > +case E_V2BFmode: >n_elt = GET_MODE_NUNITS (mode); >v = rtvec_alloc (n_elt); >scalar_mode = GET_MODE_INNER (mode); > @@ -16215,6 +16217,8 @@ ix86_build_signbit_mask (machine_mode mode, bool > vect, bool invert) > case E_V32BFmode: > case E_V16BFmode: > case E_V8BFmode: > +case E_V4BFmode: > +case E_V2BFmode: >vec_mode = mode; >imode = HImode; >break; > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index cb2697537a8..44adcd8d8e0 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -121,7 +121,7 @@ > ;; Mapping of vector float modes to an integer mode of the same size > (define_mode_attr mmxintvecmode >[(V2SF "V2SI") (V2SI "V2SI") (V4HI "V4HI") (V8QI "V8QI") > - (V4HF "V4HI") (V2HF "V2HI")]) > + (V4HF "V4HI") (V2HF "V2HI") (V4BF "V4HI") (V2BF "V2HI")]) > > (define_mode_attr mmxintvecmodelower >[(V2SF "v2si") (V2SI "v2si") (V4HI "v4hi") (V8QI "v8qi") > @@ -2091,18 +2091,22 @@ >DONE; > }) > > +(define_mode_iterator VHBF_32_64 > + [V2BF (V4BF "TARGET_MMX_WITH_SSE") > + V2HF (V4HF "TARGET_MMX_WITH_SSE")]) > + > (define_expand "2" > - [(set (match_operand:VHF_32_64 0 "register_operand") > - (absneg:VHF_32_64 > - (match_operand:VHF_32_64 1 "register_operand")))] > + [(set (match_operand:VHBF_32_64 0 "register_operand") > + (absneg:VHBF_32_64 > + (match_operand:VHBF_32_64 1 "register_operand")))] >"TARGET_SSE" >"ix86_expand_fp_absneg_operator (, mode, operands); DONE;") > > (define_insn_and_split "*mmx_" > - [(set (match_operand:VHF_32_64 0 "register_operand" "=x,x,x") > - (absneg:VHF_32_64 > - (match_operand:VHF_32_64 1 "register_operand" "0,x,x"))) > - (use (match_operand:VHF_32_64 2 "register_operand" "x,0,x"))] > + [(set (match_operand:VHBF_32_64 0 "register_operand" "=x,x,x") > + (absneg:VHBF_32_64 > + (match_operand:VHBF_32_64 1 "register_operand" "0,x,x"))) > + (use (match_operand:VHBF_32_64 2 "register_operand" "x,0,x"))] >"TARGET_SSE" >"#" >"&& reload_completed" > @@ -2115,11 +2119,11 @@ >[(set_attr "isa" "noavx,noavx,avx")]) > > (define_insn_and_split "*mmx_nabs2" > - [(set (match_operand:VHF_32_64 0 "register_operand" "=x,x,x") > - (neg:VHF_32_64 > - (abs:VHF_32_64 > - (match_operand:VHF_32_64 1 "register_operand" "0,x,x" > - (use (match_operand:VHF_32_64 2 "register_operand" "x,0,x"))] > + [(set (match_operand:VHBF_32_64 0 "register_operand" "=x,x,x") > + (neg:VHBF_32_64 > + (abs:VHBF_32_64 > + (match_operand:VHBF_32_64 1 "register_operand" "0,x,x" > + (use (match_operand:VHBF_32_64 2 "register_operand" "x,0,x"))] >"TARGET_SSE" >"#" >"&& reload_completed" > @@ -2410,11 +2414,11 @@ > ; > > (define_insn "*mmx_andnot3" > - [(set (match_operand:VHF_32_64 0 "register_operand""=x,x") > - (and:VHF_32_64 > - (not:VHF_32_64 > - (match_operand:VHF_32_64 1 "register_operand" "0,x")) > - (match_operand:VHF_32_64 2 "register_operand" "x,x")))] > + [(set (match_operand:VHBF_32_64 0 "register_operand""=x,x") > + (and:VHBF_32_64 > + (not:VHBF_32_64 > + (match_operand:VHBF_32_64 1 "register_operand" "0,x")) > + (match_operand:VHBF_32_64 2 "register_operand" "x,x")))] >"TARGET_SSE" >"@ > andnps\t{%2, %0|%0, %2} > @@ -2425,10 +2429,10 @@ > (set_attr "mode" "V4SF")]) > > (define_insn "3" > - [(set (match_operand:VHF_32_64 0 "register_operand" "=x,x") > - (any_logic:VHF_32_64
Re: [PATCH] i386: Support partial vectorized FMA for V2BF/V4BF
On Wed, Sep 4, 2024 at 11:31 AM Levy Hsu wrote: > > Hi > > Bootstrapped and tested on x86-64-pc-linux-gnu. > Ok for trunk? Ok. > > This patch introduces support for vectorized FMA operations for bf16 types in > V2BF and V4BF modes on the i386 architecture. New mode iterators and > define_expand entries for fma, fnma, fms, and fnms operations are added in > mmx.md, enhancing the i386 backend to handle these complex arithmetic > operations. > > gcc/ChangeLog: > > * config/i386/mmx.md (TARGET_MMX_WITH_SSE): New mode iterator > VBF_32_64 > (fma4): define_expand for V2BF/V4BF fma4. > (fnma4): define_expand for V2BF/V4BF fnma4. > (fms4): define_expand for V2BF/V4BF fms4. > (fnms4): define_expand for V2BF/V4BF fnms4. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c: New test. > --- > gcc/config/i386/mmx.md| 84 ++- > .../i386/avx10_2-partial-bf-vector-fma-1.c| 57 + > 2 files changed, 139 insertions(+), 2 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fma-1.c > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index 10fcd2beda6..22aeb43f436 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -2636,6 +2636,88 @@ >DONE; > }) > > +(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")]) > + > +(define_expand "fma4" > + [(set (match_operand:VBF_32_64 0 "register_operand") > + (fma:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand") > + (match_operand:VBF_32_64 2 "nonimmediate_operand") > + (match_operand:VBF_32_64 3 "nonimmediate_operand")))] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), > mode); > + rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), > mode); > + rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), > mode); > + > + emit_insn (gen_fmav8bf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > +(define_expand "fms4" > + [(set (match_operand:VBF_32_64 0 "register_operand") > + (fma:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand") > + (match_operand:VBF_32_64 2 "nonimmediate_operand") > + (neg:VBF_32_64 > + (match_operand:VBF_32_64 3 "nonimmediate_operand"] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), > mode); > + rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), > mode); > + rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), > mode); > + > + emit_insn (gen_fmsv8bf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > +(define_expand "fnma4" > + [(set (match_operand:VBF_32_64 0 "register_operand") > + (fma:VBF_32_64 > + (neg:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand")) > + (match_operand:VBF_32_64 2 "nonimmediate_operand") > + (match_operand:VBF_32_64 3 "nonimmediate_operand")))] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), > mode); > + rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), > mode); > + rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), > mode); > + > + emit_insn (gen_fnmav8bf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > +(define_expand "fnms4" > + [(set (match_operand:VBF_32_64 0 "register_operand") > + (fma:VBF_32_64 > + (neg:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand")) > + (match_operand:VBF_32_64 2 "nonimmediate_operand") > + (neg:VBF_32_64 > + (match_operand:VBF_32_64 3 "nonimmediate_operand"] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, force_reg (mode, operands[1]), > mode); > + rtx op2 = lowpart_subreg (V8BFmode, force_reg (mode, operands[2]), > mode); > + rtx op3 = lowpart_subreg (V8BFmode, force_reg (mode, operands[3]), > mode); > + > + emit_insn (gen_fnmsv8bf4 (op0, op1, op2, op3)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > > ;; > ;; Parallel half-precision floating point complex type operations > @@ -6670,8 +6752,6 @@ > (set_attr "modrm" "0") > (set_attr "memory" "none")]) > > -(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")]) > - > ;; VDIVNEPBF16 does not
Re: [PATCH] i386: Fix vfpclassph non-optimizied intrin
On Tue, Sep 3, 2024 at 2:24 PM Haochen Jiang wrote: > > Hi all, > > The intrin for non-optimized got a typo in mask type, which will cause > the high bits of __mmask32 being unexpectedly zeroed. > > The test does not fail under O0 with current 1b since the testcase is > wrong. We need to include avx512-mask-type.h after SIZE is defined, or > it will always be __mmask8. That problem also happened in AVX10.2 testcases. > I will write a seperate patch to fix that. > > Bootstrapped and tested on x86-64-pc-linux-gnu. Ok for trunk? Ok, please backport. > > Thx, > Haochen > > gcc/ChangeLog: > > * config/i386/avx512fp16intrin.h > (_mm512_mask_fpclass_ph_mask): Correct mask type to __mmask32. > (_mm512_fpclass_ph_mask): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx512fp16-vfpclassph-1c.c: New test. > --- > gcc/config/i386/avx512fp16intrin.h| 4 +- > .../i386/avx512fp16-vfpclassph-1c.c | 77 +++ > 2 files changed, 79 insertions(+), 2 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c > > diff --git a/gcc/config/i386/avx512fp16intrin.h > b/gcc/config/i386/avx512fp16intrin.h > index 1869a920dd3..c3096b74ad2 100644 > --- a/gcc/config/i386/avx512fp16intrin.h > +++ b/gcc/config/i386/avx512fp16intrin.h > @@ -3961,11 +3961,11 @@ _mm512_fpclass_ph_mask (__m512h __A, const int __imm) > #else > #define _mm512_mask_fpclass_ph_mask(u, x, c) \ >((__mmask32) __builtin_ia32_fpclassph512_mask ((__v32hf) (__m512h) (x), \ > -(int) (c),(__mmask8)(u))) > +(int) (c),(__mmask32)(u))) > > #define _mm512_fpclass_ph_mask(x, c)\ >((__mmask32) __builtin_ia32_fpclassph512_mask ((__v32hf) (__m512h) (x), \ > -(int) (c),(__mmask8)-1)) > +(int) (c),(__mmask32)-1)) > #endif /* __OPIMTIZE__ */ > > /* Intrinsics vgetexpph. */ > diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c > b/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c > new file mode 100644 > index 000..4739f1228e3 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-vfpclassph-1c.c > @@ -0,0 +1,77 @@ > +/* { dg-do run } */ > +/* { dg-options "-O0 -mavx512fp16" } */ > +/* { dg-require-effective-target avx512fp16 } */ > + > +#define AVX512FP16 > +#include "avx512f-helper.h" > + > +#include > +#include > +#include > +#define SIZE (AVX512F_LEN / 16) > +#include "avx512f-mask-type.h" > + > +#ifndef __FPCLASSPH__ > +#define __FPCLASSPH__ > +int check_fp_class_hp (_Float16 src, int imm) > +{ > + int qNaN_res = isnan (src); > + int sNaN_res = isnan (src); > + int Pzero_res = (src == 0.0); > + int Nzero_res = (src == -0.0); > + int PInf_res = (isinf (src) == 1); > + int NInf_res = (isinf (src) == -1); > + int Denorm_res = (fpclassify (src) == FP_SUBNORMAL); > + int FinNeg_res = __builtin_finite (src) && (src < 0); > + > + int result = (((imm & 1) && qNaN_res) > + || (((imm >> 1) & 1) && Pzero_res) > + || (((imm >> 2) & 1) && Nzero_res) > + || (((imm >> 3) & 1) && PInf_res) > + || (((imm >> 4) & 1) && NInf_res) > + || (((imm >> 5) & 1) && Denorm_res) > + || (((imm >> 6) & 1) && FinNeg_res) > + || (((imm >> 7) & 1) && sNaN_res)); > + return result; > +} > +#endif > + > +MASK_TYPE > +CALC (_Float16 *s1, int imm) > +{ > + int i; > + MASK_TYPE res = 0; > + > + for (i = 0; i < SIZE; i++) > +if (check_fp_class_hp(s1[i], imm)) > + res = res | (1 << i); > + > + return res; > +} > + > +void > +TEST (void) > +{ > + int i; > + UNION_TYPE (AVX512F_LEN, h) src; > + MASK_TYPE res1, res2, res_ref = 0; > + MASK_TYPE mask = MASK_VALUE; > + > + src.a[SIZE - 1] = NAN; > + src.a[SIZE - 2] = 1.0 / 0.0; > + for (i = 0; i < SIZE - 2; i++) > +{ > + src.a[i] = -24.43 + 0.6 * i; > +} > + > + res1 = INTRINSIC (_fpclass_ph_mask) (src.x, 0xFF); > + res2 = INTRINSIC (_mask_fpclass_ph_mask) (mask, src.x, 0xFF); > + > + res_ref = CALC (src.a, 0xFF); > + > + if (res_ref != res1) > +abort (); > + > + if ((mask & res_ref) != res2) > +abort (); > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [r15-3359 Regression] FAIL: gcc.target/i386/avx10_2-bf-vector-cmpp-1.c (test for excess errors) on Linux/x86_64
On Tue, Sep 3, 2024 at 9:45 AM Jiang, Haochen via Gcc-regression wrote: > > As each AVX10.2 testcases previously, this is caused by option combination > warning, > which is expected. > Can we put the warning for mix usage of mavx10 and -mavx512f under -Wpsabi And add -Wno-psabi in addition to -march=cascadelake to avoid the false positive? -- BR, Hongtao
Re: [PATCH] i386: Support partial vectorized V2BF/V4BF smaxmin
On Mon, Sep 2, 2024 at 4:42 PM Levy Hsu wrote: > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? Ok. > > This patch supports sminmax for partial vectorized V2BF/V4BF. > > gcc/ChangeLog: > > * config/i386/mmx.md (3): New define_expand for > V2BF/V4BFsmaxmin > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c: New test. > --- > gcc/config/i386/mmx.md| 19 ++ > .../avx10_2-partial-bf-vector-smaxmin-1.c | 36 +++ > 2 files changed, 55 insertions(+) > create mode 100644 > gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index 9116ddb5321..3f12a1349ab 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -2098,6 +2098,25 @@ >DONE; > }) > > +(define_expand "3" > + [(set (match_operand:VBF_32_64 0 "register_operand") > +(smaxmin:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand") > + (match_operand:VBF_32_64 2 "nonimmediate_operand")))] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[1]), mode); > + rtx op2 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[2]), mode); > + > + emit_insn (gen_v8bf3 (op0, op1, op2)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > (define_expand "sqrt2" >[(set (match_operand:VHF_32_64 0 "register_operand") > (sqrt:VHF_32_64 > diff --git > a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c > new file mode 100644 > index 000..0a7cc58e29d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-smaxmin-1.c > @@ -0,0 +1,36 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-mavx10.2 -Ofast" } */ > +/* /* { dg-final { scan-assembler-times "vmaxpbf16" 2 } } */ > +/* /* { dg-final { scan-assembler-times "vminpbf16" 2 } } */ > + > +void > +maxpbf16_64 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict > src2) > +{ > + int i; > + for (i = 0; i < 4; i++) > +dest[i] = src1[i] > src2[i] ? src1[i] : src2[i]; > +} > + > +void > +maxpbf16_32 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict > src2) > +{ > + int i; > + for (i = 0; i < 2; i++) > +dest[i] = src1[i] > src2[i] ? src1[i] : src2[i]; > +} > + > +void > +minpbf16_64 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict > src2) > +{ > + int i; > + for (i = 0; i < 4; i++) > +dest[i] = src1[i] < src2[i] ? src1[i] : src2[i]; > +} > + > +void > +minpbf16_32 (__bf16* restrict dest, __bf16* restrict src1, __bf16* restrict > src2) > +{ > + int i; > + for (i = 0; i < 2; i++) > +dest[i] = src1[i] < src2[i] ? src1[i] : src2[i]; > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Support partial vectorized V2BF/V4BF plus/minus/mult/div/sqrt
On Mon, Sep 2, 2024 at 4:33 PM Levy Hsu wrote: > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? > > This patch introduces new mode iterators and expands for the i386 > architecture to support partial vectorization of bf16 operations using > AVX10.2 instructions. These operations include addition, subtraction, > multiplication, division, and square root calculations for V2BF and V4BF data > types. Ok. > > gcc/ChangeLog: > > * config/i386/mmx.md (VBF_32_64): New mode iterator for partial > vectorized V2BF/V4BF. > (3): New define_expand for plusminusmultdiv. > (sqrt2): New define_expand for sqrt. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c: New test. > * gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c: New test. > --- > gcc/config/i386/mmx.md| 37 > .../avx10_2-partial-bf-vector-fast-math-1.c | 22 +++ > .../avx10_2-partial-bf-vector-operations-1.c | 57 +++ > 3 files changed, 116 insertions(+) > create mode 100644 > gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c > create mode 100644 > gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md > index e0065ed4d48..9116ddb5321 100644 > --- a/gcc/config/i386/mmx.md > +++ b/gcc/config/i386/mmx.md > @@ -94,6 +94,8 @@ > > (define_mode_iterator VHF_32_64 [V2HF (V4HF "TARGET_MMX_WITH_SSE")]) > > +(define_mode_iterator VBF_32_64 [V2BF (V4BF "TARGET_MMX_WITH_SSE")]) > + > ;; Mapping from integer vector mode to mnemonic suffix > (define_mode_attr mmxvecsize >[(V8QI "b") (V4QI "b") (V2QI "b") > @@ -2036,6 +2038,26 @@ >DONE; > }) > > +;; VDIVNEPBF16 does not generate floating point exceptions. > +(define_expand "3" > + [(set (match_operand:VBF_32_64 0 "register_operand") > +(plusminusmultdiv:VBF_32_64 > + (match_operand:VBF_32_64 1 "nonimmediate_operand") > + (match_operand:VBF_32_64 2 "nonimmediate_operand")))] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[1]), mode); > + rtx op2 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[2]), mode); > + > + emit_insn (gen_v8bf3 (op0, op1, op2)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > (define_expand "divv2hf3" >[(set (match_operand:V2HF 0 "register_operand") > (div:V2HF > @@ -2091,6 +2113,21 @@ >DONE; > }) > > +(define_expand "sqrt2" > + [(set (match_operand:VBF_32_64 0 "register_operand") > + (sqrt:VBF_32_64 (match_operand:VBF_32_64 1 "vector_operand")))] > + "TARGET_AVX10_2_256" > +{ > + rtx op0 = gen_reg_rtx (V8BFmode); > + rtx op1 = lowpart_subreg (V8BFmode, > + force_reg (mode, operands[1]), mode); > + > + emit_insn (gen_sqrtv8bf2 (op0, op1)); > + > + emit_move_insn (operands[0], lowpart_subreg (mode, op0, V8BFmode)); > + DONE; > +}) > + > (define_expand "2" >[(set (match_operand:VHF_32_64 0 "register_operand") > (absneg:VHF_32_64 > diff --git > a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c > new file mode 100644 > index 000..fd064f17445 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-fast-math-1.c > @@ -0,0 +1,22 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-mavx10.2 -O2" } */ > +/* { dg-final { scan-assembler-times "vmulnepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > +/* { dg-final { scan-assembler-times "vrcppbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ \\t\]+#)" 2 } } */ > + > +typedef __bf16 v4bf __attribute__ ((__vector_size__ (8))); > +typedef __bf16 v2bf __attribute__ ((__vector_size__ (4))); > + > + > +__attribute__((optimize("fast-math"))) > +v4bf > +foo_div_fast_math_4 (v4bf a, v4bf b) > +{ > + return a / b; > +} > + > +__attribute__((optimize("fast-math"))) > +v2bf > +foo_div_fast_math_2 (v2bf a, v2bf b) > +{ > + return a / b; > +} > diff --git > a/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c > b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c > new file mode 100644 > index 000..e7ee08a20a9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/avx10_2-partial-bf-vector-operations-1.c > @@ -0,0 +1,57 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-mavx10.2 -O2" } */ > +/* { dg-final { scan-assembler-times "vmulnepbf16\[ > \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ > \\t\]+#)" 2 } } */ > +/* { dg-final { scan-assembler-times "vaddnepbf16\[
Re: [PATCH 0/8] i386: Opmitize code with AVX10.2 new instructions
On Mon, Aug 26, 2024 at 2:43 PM Haochen Jiang wrote: > > Hi all, > > I have just commited AVX10.2 new instructions patches into trunk hours > ago. The next and final part for AVX10.2 upstream is to optimize code > with AVX10.2 new instructions. > > In this patch series, it will contain the following optimizations: > > - VNNI instruction auto vectorize (PATCH 1). > - Codegen optimization with new scalar comparison instructions to > eliminate redundant code (PATCH 2-3). > - BF16 instruction auto vectorize (PATCH 4-8). > > This will finish the upstream for AVX10.2 series. > > Afterwards, we may add V2BF/V4BF in another thread just like what we > have done for V2HF/V4HF when AVX512FP16 upstreamed. > > Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk? Ok for all 8 patches. > > Thx, > Haochen > > -- BR, Hongtao
Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs
On Fri, Aug 23, 2024 at 5:46 PM HAO CHEN GUI wrote: > > Hi Hongtao, > > 在 2024/8/23 11:47, Hongtao Liu 写道: > > On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI wrote: > >> > >> Hi Hongtao, > >> > >> 在 2024/8/23 9:47, Hongtao Liu 写道: > >>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI > >>> wrote: > >>>> > >>>> Hi Hongtao, > >>>> > >>>> 在 2024/8/21 11:21, Hongtao Liu 写道: > >>>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please > >>>>> rebase your patch and see if there's still the regressions. > >>>> > >>>> There's still regressions. The patch enables V16QI const0 store, but > >>>> it also enables V8QI const0 store. The vector mode is preferable than > >>>> scalar mode so that V8QI is used for 8-byte memory clear instead of > >>>> DI. It's sub-optimal. > >>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT? > >> Not sure if all targets prefer it. Richard & Jeff, what's your opinion? > >> > >> IMHO, could we disable it from predicate or convert it to DI mode store > >> if V8QI const0 store is sub-optimal on i386? > >> > >> > >>>> > >>>> Another issue is it takes lots of subreg to generate an all-zero > >>>> V16QI register sometime. As PR92080 has been fixed, it can't reuse > >>>> existing all-zero V16QI register. > > Backend rtx_cost needs to be adjusted to prevent const0 propagation. > > The current rtx_cost for const0 for i386 is 0, which will enable > > propagation of const0. > > > >/* If MODE2 is appropriate for an MMX register, then tie > > @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode, > > int outer_code_i, int opno, > > case 0: > > break; > > case 1: /* 0: xor eliminates false dependency */ > > - *total = 0; > > + /* Add extra cost 1 to prevent propagation of CONST_VECTOR > > +for SET, which will enable more CSE optimization. */ > > + *total = 0 + (outer_code == SET); > > return true; > > default: /* -1: cmp contains false dependency */ > > - *total = 1; > > + *total = 1 + (outer_code == SET); > > return true; > > } > > > > the upper hunk should help for that. > Sorry, I didn't get your point. Which problem it will fix? I tested > upper code. Nothing changed. Which kind of const0 propagation you want > to prevent? The patch itself doesn't enable CSE for const0_rtx, but it's needed after cse_insn recognizes CONST0_RTX with a different mode and replaces them with subreg. I thought you had changed the cse_insn part. On the other hand, pxor is cheap, what matters more is the CSE of broadcasting the same value to different modes. i.e. __m512i sinkz; __m256i sinky; void foo(char c) { sinkz = _mm512_set1_epi8(c); sinky = _mm256_set1_epi8(c); } > > Thanks > Gui Haochen > > >>>> > >>>> (insn 16 15 17 (set (reg:V4SI 118) > >>>> (const_vector:V4SI [ > >>>> (const_int 0 [0]) repeated x4 > >>>> ])) "auto-init-7.c":25:12 -1 > >>>> (nil)) > >>>> > >>>> (insn 17 16 18 (set (reg:V8HI 117) > >>>> (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1 > >>>> (nil)) > >>>> > >>>> (insn 18 17 19 (set (reg:V16QI 116) > >>>> (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1 > >>>> (nil)) > >>>> > >>>> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114) > >>>> (const_int 12 [0xc])) [0 MEM [(void > >>>> *)&temp3]+12 S16 A32]) > >>>> (reg:V16QI 116)) "auto-init-7.c":25:12 -1 > >>>> (nil)) > >>> I think those subregs can be simplified by later rtl passes? > >> > >> Here is the final dump. There are two all-zero 16-byte vector > >> registers. It can't figure out V4SI could be a subreg of V16QI. > >> > >> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115]) > >> (const_vector:V16QI [ > >> (const_int 0 [0]) repeated x16 > >> ])) "auto-init-7.c":25:12 2154 {movv16qi_internal} > >> (nil)) > >> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM > >> [(void *)&temp3]+0 S16 A128]) > >> (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 > >> {movv16qi_internal} > >> (nil)) > >> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118]) > >> (const_vector:V4SI [ > >> (const_int 0 [0]) repeated x4 > >> ])) "auto-init-7.c":25:12 2160 {movv4si_internal} > >> (nil)) > >> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114]) > >> (const_int 12 [0xc])) [0 MEM [(void > >> *)&temp3]+12 S16 A32]) > >> (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 > >> {movv16qi_internal} > >> > >> Thanks > >> Gui Haochen > >> > >>>> > >>>> Thanks > >>>> Gui Haochen > >>> > >>> > >>> > > > > > > -- BR, Hongtao
Re: [PATCH 00/12] AVX10.2: Support new instructions
On Mon, Aug 19, 2024 at 4:57 PM Haochen Jiang wrote: > > Hi all, > > The AVX10.2 ymm rounding patches has been merged to trunk around > 6 hours ago. As mentioned before, next step will be AVX10.2 new > instruction support. > > This patch series could be divided into three part. > > The first patch will refactor m512-check.h under testsuite to reuse > AVX-512 helper functions and unions and avoid ABI warnings when using > AVX10. > > The following ten patches will support all AVX10.2 new instrctions, > including: > > - AI Datatypes, Conversions, and post-Convolution Instructions. > - Media Acceleration. > - IEEE-754-2019 Minimum and Maximum Support. > - Saturating Conversions. > - Zero-extending Partial Vector Copies. > - FP Scalar Comparison. > > For FP Scalar Comparison part (a.k.a comx instructions), we will only > provide pattern support but not intrin support since it is redundant > with comi ones for common usage. We will also add some optimizations > afterwards for common usage with comx instructions. If there are some > strong requests, we will add intrin support in the future. > > The final patch will add bf8 -> fp16 intrin for convenience. Since the > conversion from bf8 to fp16 is only casting for fraction part due to > same bits for exponent part, we will use a sequence of instructions > instead of new instructions. It is just like the scenario for bf16 -> > fp32 conversion. > > After all these patch merged, the next step would be optimizations based > on AVX10.2 new instructions, including vnni vectorization, bf16 > vectorization, comx optmization, etc. > > Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk? Ok for all 12 patches. > > Thx, > Haochen > -- BR, Hongtao
Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs
On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI wrote: > > Hi Hongtao, > > 在 2024/8/23 9:47, Hongtao Liu 写道: > > On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI wrote: > >> > >> Hi Hongtao, > >> > >> 在 2024/8/21 11:21, Hongtao Liu 写道: > >>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please > >>> rebase your patch and see if there's still the regressions. > >> > >> There's still regressions. The patch enables V16QI const0 store, but > >> it also enables V8QI const0 store. The vector mode is preferable than > >> scalar mode so that V8QI is used for 8-byte memory clear instead of > >> DI. It's sub-optimal. > > Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT? > Not sure if all targets prefer it. Richard & Jeff, what's your opinion? > > IMHO, could we disable it from predicate or convert it to DI mode store > if V8QI const0 store is sub-optimal on i386? > > > >> > >> Another issue is it takes lots of subreg to generate an all-zero > >> V16QI register sometime. As PR92080 has been fixed, it can't reuse > >> existing all-zero V16QI register. Backend rtx_cost needs to be adjusted to prevent const0 propagation. The current rtx_cost for const0 for i386 is 0, which will enable propagation of const0. /* If MODE2 is appropriate for an MMX register, then tie @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, case 0: break; case 1: /* 0: xor eliminates false dependency */ - *total = 0; + /* Add extra cost 1 to prevent propagation of CONST_VECTOR +for SET, which will enable more CSE optimization. */ + *total = 0 + (outer_code == SET); return true; default: /* -1: cmp contains false dependency */ - *total = 1; + *total = 1 + (outer_code == SET); return true; } the upper hunk should help for that. > >> > >> (insn 16 15 17 (set (reg:V4SI 118) > >> (const_vector:V4SI [ > >> (const_int 0 [0]) repeated x4 > >> ])) "auto-init-7.c":25:12 -1 > >> (nil)) > >> > >> (insn 17 16 18 (set (reg:V8HI 117) > >> (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1 > >> (nil)) > >> > >> (insn 18 17 19 (set (reg:V16QI 116) > >> (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1 > >> (nil)) > >> > >> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114) > >> (const_int 12 [0xc])) [0 MEM [(void > >> *)&temp3]+12 S16 A32]) > >> (reg:V16QI 116)) "auto-init-7.c":25:12 -1 > >> (nil)) > > I think those subregs can be simplified by later rtl passes? > > Here is the final dump. There are two all-zero 16-byte vector > registers. It can't figure out V4SI could be a subreg of V16QI. > > (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115]) > (const_vector:V16QI [ > (const_int 0 [0]) repeated x16 > ])) "auto-init-7.c":25:12 2154 {movv16qi_internal} > (nil)) > (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM > [(void *)&temp3]+0 S16 A128]) > (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 > {movv16qi_internal} > (nil)) > (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118]) > (const_vector:V4SI [ > (const_int 0 [0]) repeated x4 > ])) "auto-init-7.c":25:12 2160 {movv4si_internal} > (nil)) > (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114]) > (const_int 12 [0xc])) [0 MEM [(void *)&temp3]+12 > S16 A32]) > (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 > {movv16qi_internal} > > Thanks > Gui Haochen > > >> > >> Thanks > >> Gui Haochen > > > > > > -- BR, Hongtao
Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs
On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI wrote: > > Hi Hongtao, > > 在 2024/8/21 11:21, Hongtao Liu 写道: > > r15-3058-gbb42c551905024 support const0 operand for movv16qi, please > > rebase your patch and see if there's still the regressions. > > There's still regressions. The patch enables V16QI const0 store, but > it also enables V8QI const0 store. The vector mode is preferable than > scalar mode so that V8QI is used for 8-byte memory clear instead of > DI. It's sub-optimal. Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT? > > Another issue is it takes lots of subreg to generate an all-zero > V16QI register sometime. As PR92080 has been fixed, it can't reuse > existing all-zero V16QI register. > > (insn 16 15 17 (set (reg:V4SI 118) > (const_vector:V4SI [ > (const_int 0 [0]) repeated x4 > ])) "auto-init-7.c":25:12 -1 > (nil)) > > (insn 17 16 18 (set (reg:V8HI 117) > (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1 > (nil)) > > (insn 18 17 19 (set (reg:V16QI 116) > (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1 > (nil)) > > (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114) > (const_int 12 [0xc])) [0 MEM [(void *)&temp3]+12 > S16 A32]) > (reg:V16QI 116)) "auto-init-7.c":25:12 -1 > (nil)) I think those subregs can be simplified by later rtl passes? > > Thanks > Gui Haochen -- BR, Hongtao
Re: [PATCH] Align ix86_{move_max,store_max} with vectorizer.
On Wed, Aug 21, 2024 at 4:49 PM Richard Biener wrote: > > On Wed, Aug 21, 2024 at 7:40 AM liuhongt wrote: > > > > When none of mprefer-vector-width, avx256_optimal/avx128_optimal, > > avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will > > set ix86_{move_max,store_max} as max available vector length except > > for AVX part. > > > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > opts->x_ix86_move_max = PVW_AVX512; > > else > > opts->x_ix86_move_max = PVW_AVX128; > > > > So for -mavx2, vectorizer will choose 256-bit for vectorization, but > > 128-bit is used for struct copy, there could be a potential STLF issue > > due to this "misalign". > > > > The patch fixes that and improved 538.imagick_r by ~30% for > > -march=x86-64-v3 -O2. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Any comments? > > Should we look at the avx128_optimal tune and/or avx256_split_regs and > avx256_optimal > also for 512? Because IIRC the vectorizers default looks at that as > well (OTOH larger > stores should be fine for STLF). For Double Pumped processors, i.e. SRF, there's no STLF issue for 128-bit store and 256-bit load since the 256-bit load is teared down to 2 128-bit load. I guess it should be similar for Znver1/Znve4, so it should be fine with the mismatch between struct copy and vectorizer size. One exception is that we use 256-bit for vectorization and 512-bit for struct copy on SPR, it could be an issue when the struct copy is after the vectorization. But I didn't observe any cases yet, and for not-STLF-stall case, 512-bit copy should be better than 256-bit copy on SPR, So I'll leave it there.(There's a plan to enable 512-bit vectorization for SPR by default, it's ongoing). > > > gcc/ChangeLog: > > > > * config/i386/i386-options.cc (ix86_option_override_internal): > > set ix86_{move_max,store_max} to PVW_AVX256 when TARGET_AVX > > instead of PVW_AVX128. > > > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pieces-memcpy-10.c: Add -mprefer-vector-width=128. > > * gcc.target/i386/pieces-memcpy-6.c: Ditto. > > * gcc.target/i386/pieces-memset-38.c: Ditto. > > * gcc.target/i386/pieces-memset-40.c: Ditto. > > * gcc.target/i386/pieces-memset-41.c: Ditto. > > * gcc.target/i386/pieces-memset-42.c: Ditto. > > * gcc.target/i386/pieces-memset-43.c: Ditto. > > * gcc.target/i386/pieces-strcpy-2.c: Ditto. > > * gcc.target/i386/pieces-memcpy-22.c: New test. > > * gcc.target/i386/pieces-memset-51.c: New test. > > * gcc.target/i386/pieces-strcpy-3.c: New test. > > --- > > gcc/config/i386/i386-options.cc | 6 ++ > > gcc/testsuite/gcc.target/i386/pieces-memcpy-10.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memcpy-22.c | 12 > > gcc/testsuite/gcc.target/i386/pieces-memcpy-6.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-38.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-40.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-41.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-42.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-43.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-memset-51.c | 12 > > gcc/testsuite/gcc.target/i386/pieces-strcpy-2.c | 2 +- > > gcc/testsuite/gcc.target/i386/pieces-strcpy-3.c | 15 +++ > > 12 files changed, 53 insertions(+), 8 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memcpy-22.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-memset-51.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pieces-strcpy-3.c > > > > diff --git a/gcc/config/i386/i386-options.cc > > b/gcc/config/i386/i386-options.cc > > index f423455b363..f79257cc764 100644 > > --- a/gcc/config/i386/i386-options.cc > > +++ b/gcc/config/i386/i386-options.cc > > @@ -3023,6 +3023,9 @@ ix86_option_override_internal (bool main_args_p, > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > opts->x_ix86_move_max = PVW_AVX512; > > + /* Align with vectorizer to avoid potential STLF issue. */ > > + else if (TARGET_AVX_P (opts->x_ix86_isa_flags)) > > + opts->x_ix86_move_max = PVW_AVX256; > > else > > opts->x_ix86_move_max = PVW_AVX128; > > } > > @@ -3047,6 +3050,9 @@ ix86_option_override_internal (bool main_args_p, > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > opts->x_ix86_store_max = PVW_AVX512; > > + /* Align with vectorizer to avoid potential STLF issue. */ > > + else if (TARGET_AVX_P (opts->x_ix86_isa_
Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs
On Tue, Aug 20, 2024 at 2:50 PM Hongtao Liu wrote: > > On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI wrote: > > > > Hi, > > Add Hongtao Liu as the patch affects x86. > > > > 在 2024/8/20 6:32, Richard Sandiford 写道: > > > HAO CHEN GUI writes: > > >> Hi, > > >> This patch adds const0 move checking for CLEAR_BY_PIECES. The original > > >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a > > >> constant. So even a platform doesn't support vec_duplicate, it could > > >> still do clear by pieces if it supports const0 move by that mode. > > >> > > >> Compared to the previous version, the main change is to set up a > > >> new function to generate const0 for certain modes and use the function > > >> as by_pieces_constfn for CLEAR_BY_PIECES. > > >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html > > >> > > >> Bootstrapped and tested on powerpc64-linux BE and LE with no > > >> regressions. > > >> > > >> On i386, it got several regressions. One issue is the predicate of > > >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used > > >> for clear by pieces with the patch. The second issue is the const0 is > > >> passed directly to the move expand with the patch. Originally it is > > >> forced to a pseudo and i386 can leverage the previous data to do > > >> optimization. > > > > > > The patch looks good to me, but I suppose we'll need to decide what > > > to do about x86. > > > > > > It's not obvious to me why movv16qi requires a nonimmediate_operand > > > source, especially since ix86_expand_vector_mode does have code to > > > cope with constant operand[1]s. emit_move_insn_1 doesn't check the > > > predicates anyway, so the predicate will have little effect. > > > > > > A workaround would be to check legitimate_constant_p instead of the > > > predicate, but I'm not sure that that should be necessary. > > > > > > Has this already been discussed? If not, we should loop in the x86 > > > maintainers (but I didn't do that here in case it would be a repeat). > > > > I also noticed it. Not sure why movv16qi requires a > > nonimmediate_operand, while ix86_expand_vector_mode could deal with > > constant op. Looking forward to Hongtao's comments. > The code has been there since 2005 before I'm involved. > It looks to me at the beginning both mov and > *mov_internal only support nonimmediate_operand for the > operands[1]. > And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to > nonimmediate_or_sse_const_operand for *mov_internal, but not for > mov. > I think we can align the predicate between mov and *mov_internal. > I'll do some tests and reach back to you. r15-3058-gbb42c551905024 support const0 operand for movv16qi, please rebase your patch and see if there's still the regressions. > > > > > > > > As far as the second issue goes, I suppose there are at least three > > > ways of handling shared constants: > > > > > > (1) Force the zero into a register and leave later optimisations to > > > propagate the zero where profitable. > > The zero can be propagated into the store, but the address adjustment > > may not be combined into insn properly. For instance, if zero is > > forced to a register, "movv2x8qi" insn is generated. The address > > adjustment becomes a separate insn as "movv2x8qi" insn doesn't support > > d-from address. When zero is propagated, it converts "movv2x8qi" to > > "movti". "movti" supports d-from as well as post/inc address. Probably, > > the auto_inc_dec pass combines address adjustment insn into previous > > "movti" to generate a post inc "movti". The expected optimization might > > be to combine address adjustment insn into second "movit" and generate a > > d-form "movti". It's a regression issue I found in aarch64. > > > > Also we checks if const0 is supported for mov optab. But finally we > > force the const0 to a register and generate a store with the register. > > Seems it's not reasonable. > > > > > > > > (2) Emit stores of zero and expect a later pass to share constants > > > where beneficial. > > Not sure which pass can optimize it. > > > > > > > > (3) Generate stores of zero and leave t
Re: [PATCH] Align predicates for operands[1] between mov and *mov_internal.
On Tue, Aug 20, 2024 at 6:25 PM liuhongt wrote: > > From [1] [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660575.html > > > It's not obvious to me why movv16qi requires a nonimmediate_operand > > > source, especially since ix86_expand_vector_mode does have code to > > > cope with constant operand[1]s. emit_move_insn_1 doesn't check the > > > predicates anyway, so the predicate will have little effect. > > > > > > A workaround would be to check legitimate_constant_p instead of the > > > predicate, but I'm not sure that that should be necessary. > > > > > > Has this already been discussed? If not, we should loop in the x86 > > > maintainers (but I didn't do that here in case it would be a repeat). > > > > I also noticed it. Not sure why movv16qi requires a > > nonimmediate_operand, while ix86_expand_vector_mode could deal with > > constant op. Looking forward to Hongtao's comments. > The code has been there since 2005 before I'm involved. > It looks to me at the beginning both mov and > *mov_internal only support nonimmediate_operand for the > operands[1]. > And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to > nonimmediate_or_sse_const_operand for *mov_internal, but not for > mov. I think we can align the predicate between mov > and *mov_internal. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > * config/i386/sse.md (mov): Align predicates for > operands[1] between mov and *mov_internal. > --- > gcc/config/i386/sse.md | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index d1010bc5682..7ecfbd55809 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -1387,7 +1387,7 @@ (define_mode_attr DOUBLEMASKMODE > > (define_expand "mov" >[(set (match_operand:VMOVE 0 "nonimmediate_operand") > - (match_operand:VMOVE 1 "nonimmediate_operand"))] > + (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"))] >"TARGET_SSE" > { >ix86_expand_vector_move (mode, operands); > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs
On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI wrote: > > Hi, > Add Hongtao Liu as the patch affects x86. > > 在 2024/8/20 6:32, Richard Sandiford 写道: > > HAO CHEN GUI writes: > >> Hi, > >> This patch adds const0 move checking for CLEAR_BY_PIECES. The original > >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a > >> constant. So even a platform doesn't support vec_duplicate, it could > >> still do clear by pieces if it supports const0 move by that mode. > >> > >> Compared to the previous version, the main change is to set up a > >> new function to generate const0 for certain modes and use the function > >> as by_pieces_constfn for CLEAR_BY_PIECES. > >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html > >> > >> Bootstrapped and tested on powerpc64-linux BE and LE with no > >> regressions. > >> > >> On i386, it got several regressions. One issue is the predicate of > >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used > >> for clear by pieces with the patch. The second issue is the const0 is > >> passed directly to the move expand with the patch. Originally it is > >> forced to a pseudo and i386 can leverage the previous data to do > >> optimization. > > > > The patch looks good to me, but I suppose we'll need to decide what > > to do about x86. > > > > It's not obvious to me why movv16qi requires a nonimmediate_operand > > source, especially since ix86_expand_vector_mode does have code to > > cope with constant operand[1]s. emit_move_insn_1 doesn't check the > > predicates anyway, so the predicate will have little effect. > > > > A workaround would be to check legitimate_constant_p instead of the > > predicate, but I'm not sure that that should be necessary. > > > > Has this already been discussed? If not, we should loop in the x86 > > maintainers (but I didn't do that here in case it would be a repeat). > > I also noticed it. Not sure why movv16qi requires a > nonimmediate_operand, while ix86_expand_vector_mode could deal with > constant op. Looking forward to Hongtao's comments. The code has been there since 2005 before I'm involved. It looks to me at the beginning both mov and *mov_internal only support nonimmediate_operand for the operands[1]. And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to nonimmediate_or_sse_const_operand for *mov_internal, but not for mov. I think we can align the predicate between mov and *mov_internal. I'll do some tests and reach back to you. > > > > > As far as the second issue goes, I suppose there are at least three > > ways of handling shared constants: > > > > (1) Force the zero into a register and leave later optimisations to > > propagate the zero where profitable. > The zero can be propagated into the store, but the address adjustment > may not be combined into insn properly. For instance, if zero is > forced to a register, "movv2x8qi" insn is generated. The address > adjustment becomes a separate insn as "movv2x8qi" insn doesn't support > d-from address. When zero is propagated, it converts "movv2x8qi" to > "movti". "movti" supports d-from as well as post/inc address. Probably, > the auto_inc_dec pass combines address adjustment insn into previous > "movti" to generate a post inc "movti". The expected optimization might > be to combine address adjustment insn into second "movit" and generate a > d-form "movti". It's a regression issue I found in aarch64. > > Also we checks if const0 is supported for mov optab. But finally we > force the const0 to a register and generate a store with the register. > Seems it's not reasonable. > > > > > (2) Emit stores of zero and expect a later pass to share constants > > where beneficial. > Not sure which pass can optimize it. > > > > > (3) Generate stores of zero and leave the target expanders to force > > constants into registers on the fly if reuse seems plausibly > > beneficial. > > > The constant zero with different modes are not relevant. Not sure > which pass can optimize it. The compiler should be taught that > reg 102 can be expressed as a subreg of reg 100. > > (insn 6 5 7 2 (set (reg:V32QI 100) > (const_vector:V32QI [ > (const_int 0 [0]) repeated x32 > ])) > > (insn 8 7 0 2 (set (reg:V16QI 102) > (const_vector:V16QI [ > (const_int 0 [0]) r
Re: [PATCH 00/22] Support AVX10.2 ymm rounding
On Wed, Aug 14, 2024 at 5:07 PM Haochen Jiang wrote: > > Hi all, > > The initial patch for AVX10.2 has been merged this week. > > For the upcoming patches, we will first upstream ymm rounding control part. > > In ymm rounding part, ALL the instructions in AVX512 with 512-bit rounding > control will also have 256-bit rounding control in AVX10.2. > > For clearness, the patch order is based on alphabetical order. Each patch > will include its intrin definition and related tests. Sometimes pattern is > not changed in the patch because the previous change in the patch series > has already enabled the 256 bit rounding in the pattern. > > Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk? Ok for all 22 patches in the thread. > > Thx, > Haochen > > Ref: Intel Advanced Vector Extensions 10.2 Architecture Specification > https://cdrdv2.intel.com/v1/dl/getContent/828965 > > -- BR, Hongtao
Re: [PATCH v2] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.
On Thu, Aug 15, 2024 at 3:27 PM liuhongt wrote: > > It results in 2 failures for x86_64-pc-linux-gnu{\ > -march=cascadelake}; > > gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o > gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1 > > For pr113560.c, now GCC generates mulx instead of mulq with > -march=cascadelake, which should be optimal, so adjust testcase for > that. > For gcc.target/i386/extendditi2-1.c, RA happens to choose another > register instead of rax and result in > > movq%rdi, %rbp > movq%rdi, %rax > sarq$63, %rbp > movq%rbp, %rdx > > The patch adds a new define_peephole2 for that. > > gcc/ChangeLog: > > PR target/116274 > * config/i386/i386-expand.cc (ix86_expand_vector_move): > Restrict special case TImode to 128-bit vector conversions via > V2DI under ix86_pre_reload_split (). > * config/i386/i386.cc (inline_secondary_memory_needed): > Movement between GENERAL_REGS and SSE_REGS for TImode doesn't > need secondary reload. > * config/i386/i386.md (*extendsidi2_rex64): Add a > define_peephole2 after it. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr116274.c: New test. > * gcc.target/i386/pr113560.c: Scan either mulq or mulx. > --- > gcc/config/i386/i386-expand.cc | 2 +- > gcc/config/i386/i386.cc | 18 -- > gcc/config/i386/i386.md | 19 +++ > gcc/testsuite/gcc.target/i386/pr113560.c | 2 +- > gcc/testsuite/gcc.target/i386/pr116274.c | 12 > 5 files changed, 45 insertions(+), 8 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index bdbc1423267..ed546eeed6b 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -751,7 +751,7 @@ ix86_expand_vector_move (machine_mode mode, rtx > operands[]) >&& SUBREG_P (op1) >&& GET_MODE (SUBREG_REG (op1)) == TImode >&& TARGET_64BIT && TARGET_SSE > - && can_create_pseudo_p ()) > + && ix86_pre_reload_split ()) > { >rtx tmp = gen_reg_rtx (V2DImode); >rtx lo = gen_reg_rtx (DImode); > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index f044826269c..4821892d1e0 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -20292,6 +20292,18 @@ inline_secondary_memory_needed (machine_mode mode, > reg_class_t class1, >if (!(INTEGER_CLASS_P (class1) || INTEGER_CLASS_P (class2))) > return true; > > + /* If the target says that inter-unit moves are more expensive > +than moving through memory, then don't generate them. */ > + if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC) > + || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC)) > + return true; > + > + /* Under SSE4.1, *movti_internal supports movement between > +SSE_REGS and GENERAL_REGS with pinsrq and pextrq. */ > + if (TARGET_SSE4_1 > + && (TARGET_64BIT ? mode == TImode : mode == DImode)) > + return false; > + >int msize = GET_MODE_SIZE (mode); > >/* Between SSE and general, we have moves no larger than word size. */ > @@ -20304,12 +20316,6 @@ inline_secondary_memory_needed (machine_mode mode, > reg_class_t class1, > >if (msize < minsize) > return true; > - > - /* If the target says that inter-unit moves are more expensive > -than moving through memory, then don't generate them. */ > - if ((SSE_CLASS_P (class1) && !TARGET_INTER_UNIT_MOVES_FROM_VEC) > - || (SSE_CLASS_P (class2) && !TARGET_INTER_UNIT_MOVES_TO_VEC)) > - return true; > } > >return false; > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index db7789c17d2..1962a7ba5c9 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -5041,6 +5041,25 @@ (define_split >DONE; > }) > > +(define_peephole2 > + [(set (match_operand:DI 0 "general_reg_operand") > + (match_operand:DI 1 "general_reg_operand")) > + (parallel [(set (match_dup 0) > + (ashiftrt:DI (match_dup 0) > + (const_int 63))) > + (clobber (reg:CC FLAGS_REG))]) > + (set (match_operand:DI 2 "general_reg_operand") (match_dup 1)) > + (set (match_operand:DI 3 "general_reg_operand") (match_dup 0))] > + "(optimize_function_for_size_p (cfun) || TARGET_USE_CLTD) > + && REGNO (operands[2]) == AX_REG > + && REGNO (operands[3]) == DX_REG > + && peep2_reg_dead_p (4, operands[0]) > + && !reg_mentioned_p (operands[0], operands[1]) > + && !reg_mentioned_p (operands[2], operands[0])" > + [(set (match_dup 2) (match_dup 1)) > + (parallel [(set (match_dup 3) (ashiftrt:DI (match_dup 2) (const_int 63))) > + (clobber (reg:CC FLAGS
Re: [PATCH v2] i386: Fix some vex insns that prohibit egpr
On Wed, Aug 14, 2024 at 4:23 PM Kong, Lingling wrote: > > > > -Original Message- > From: Kong, Lingling > Sent: Wednesday, August 14, 2024 4:20 PM > To: Kong, Lingling > Subject: [PATCH v2] i386: Fix some vex insns that prohibit egpr > > Although these vex insn have evex counterpart, but when it uses the displayed > vex prefix should not support APX EGPR. > Like TARGET_AVXVNNI, TARGET_IFMA and TARGET_AVXNECONVERT. > TARGET_AVXVNNIINT8 and TARGET_AVXVNNITINT16 are also vex insn should not > support egpr. Ok. > > gcc/ChangeLog: > > * config/i386/sse.md (vpmadd52): > Prohibit egpr for vex version. > (vpdpbusd_): Ditto. > (vpdpbusds_): Ditto. > (vpdpwssd_): Ditto. > (vpdpwssds_): Ditto. > (*vcvtneps2bf16_v4sf): Ditto. > (vcvtneps2bf16_v8sf): Ditto. > (vpdp_): Ditto. > (vbcstnebf162ps_): Ditto. > (vbcstnesh2ps_): Ditto. > (vcvtnee2ps_): Ditto. > (vcvtneo2ps_): Ditto. > (vpdp_): Ditto. > --- > gcc/config/i386/sse.md | 49 +++--- > 1 file changed, 32 insertions(+), 17 deletions(-) > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index > d1010bc5682..f0d94bba4e7 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -29886,7 +29886,7 @@ > (unspec:VI8_AVX2 > [(match_operand:VI8_AVX2 1 "register_operand" "0,0") >(match_operand:VI8_AVX2 2 "register_operand" "x,v") > - (match_operand:VI8_AVX2 3 "nonimmediate_operand" "xm,vm")] > + (match_operand:VI8_AVX2 3 "nonimmediate_operand" "xjm,vm")] > VPMADD52))] >"TARGET_AVXIFMA || (TARGET_AVX512IFMA && TARGET_AVX512VL)" >"@ > @@ -29894,6 +29894,7 @@ >vpmadd52\t{%3, %2, %0|%0, %2, %3}" >[(set_attr "isa" "avxifma,avx512ifmavl") > (set_attr "type" "ssemuladd") > + (set_attr "addr" "gpr16,*") > (set_attr "prefix" "vex,evex") > (set_attr "mode" "")]) > > @@ -30253,13 +30254,14 @@ > (unspec:VI4_AVX2 > [(match_operand:VI4_AVX2 1 "register_operand" "0,0") >(match_operand:VI4_AVX2 2 "register_operand" "x,v") > - (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")] > + (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")] > UNSPEC_VPDPBUSD))] >"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)" >"@ >%{vex%} vpdpbusd\t{%3, %2, %0|%0, %2, %3} >vpdpbusd\t{%3, %2, %0|%0, %2, %3}" >[(set_attr ("prefix") ("vex,evex")) > + (set_attr "addr" "gpr16,*") > (set_attr ("isa") ("avxvnni,avx512vnnivl"))]) > > (define_insn "vpdpbusd__mask" > @@ -30321,13 +30323,14 @@ > (unspec:VI4_AVX2 > [(match_operand:VI4_AVX2 1 "register_operand" "0,0") >(match_operand:VI4_AVX2 2 "register_operand" "x,v") > - (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")] > + (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")] > UNSPEC_VPDPBUSDS))] >"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)" >"@ > %{vex%} vpdpbusds\t{%3, %2, %0|%0, %2, %3} > vpdpbusds\t{%3, %2, %0|%0, %2, %3}" >[(set_attr ("prefix") ("vex,evex")) > + (set_attr "addr" "gpr16,*") > (set_attr ("isa") ("avxvnni,avx512vnnivl"))]) > > (define_insn "vpdpbusds__mask" > @@ -30389,13 +30392,14 @@ > (unspec:VI4_AVX2 > [(match_operand:VI4_AVX2 1 "register_operand" "0,0") >(match_operand:VI4_AVX2 2 "register_operand" "x,v") > - (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")] > + (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")] > UNSPEC_VPDPWSSD))] >"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)" >"@ >%{vex%} vpdpwssd\t{%3, %2, %0|%0, %2, %3} >vpdpwssd\t{%3, %2, %0|%0, %2, %3}" >[(set_attr ("prefix") ("vex,evex")) > + (set_attr "addr" "gpr16,*") > (set_attr ("isa") ("avxvnni,avx512vnnivl"))]) > > (define_insn "vpdpwssd__mask" > @@ -30457,13 +30461,14 @@ > (unspec:VI4_AVX2 > [(match_operand:VI4_AVX2 1 "register_operand" "0,0") >(match_operand:VI4_AVX2 2 "register_operand" "x,v") > - (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xm,vm")] > + (match_operand:VI4_AVX2 3 "nonimmediate_operand" "xjm,vm")] > UNSPEC_VPDPWSSDS))] >"TARGET_AVXVNNI || (TARGET_AVX512VNNI && TARGET_AVX512VL)" >"@ >%{vex%} vpdpwssds\t{%3, %2, %0|%0, %2, %3} >vpdpwssds\t{%3, %2, %0|%0, %2, %3}" >[(set_attr ("prefix") ("vex,evex")) > + (set_attr "addr" "gpr16,*") > (set_attr ("isa") ("avxvnni,avx512vnnivl"))]) > > (define_insn "vpdpwssds__mask" > @@ -30681,13 +30686,14 @@ >[(set (match_operand:V8BF 0 "register_operand" "=x,v") > (vec_concat:V8BF > (float_truncate:V4BF > - (match_operand:V4SF 1 "nonimmediate_operand" "xm,vm")) > + (match_operand:V4SF 1 "nonimmediate_o
Re: [PATCH 4/4] i386: Optimization for APX NDD is always zero-uppered for shift
On Mon, Aug 12, 2024 at 3:12 PM kong lingling wrote: > > gcc/ChangeLog: > > > PR target/113729 > >* config/i386/i386.md (*ashlqi3_1_zext): > >New define_insn. > >(*ashlhi3_1_zext): Ditto. > >(*qi3_1_zext): Ditto. > >(*hi3_1_zext): Ditto. > >(*qi3_1_zext): Ditto. > >(*hi3_1_zext): Ditto. > > > > gcc/testsuite/ChangeLog: > > > >* gcc.target/i386/pr113729.c: Add testcase for shift and > rotate. Ok. -- BR, Hongtao
Re: [PATCH 3/4] i386: Optimization for APX NDD is always zero-uppered for logic
On Mon, Aug 12, 2024 at 3:12 PM kong lingling wrote: > > gcc/ChangeLog: > > >PR target/113729 > >* config/i386/i386.md (*andqi_1_zext): > >New define_insn. > >(*andhi_1_zext): Ditto. > >(*qi_1_zext): Ditto. > >(*hi_1_zext): Ditto. > >(*negqi_1_zext): Ditto. > >(*neghi_1_zext): Ditto. > >(*one_cmplqi2_1_zext): Ditto. > >(*one_cmplhi2_1_zext): Ditto. > > > > gcc/testsuite/ChangeLog: > > > >* gcc.target/i386/pr113729.c: Add new test for logic. Ok. -- BR, Hongtao
Re: [PATCH 2/4] i386: Optimization for APX NDD is always zero-uppered for sub/adc/sbb
On Mon, Aug 12, 2024 at 3:12 PM kong lingling wrote: > > gcc/ChangeLog: > > > >PR target/113729 > >* config/i386/i386.md (*subqi_1_zext): New > >define_insn. > >(*subhi_1_zext): Ditto. > >(*addqi3_carry_zext): Ditto. > >(*addhi3_carry_zext): Ditto. > >(*addqi3_carry_zext_0): Ditto. > >(*addhi3_carry_zext_0): Ditto. > >(*addqi3_carry_zext_0r): Ditto. > >(*addhi3_carry_zext_0r): Ditto. > >(*subqi3_carry_zext): Ditto. > >(*subhi3_carry_zext): Ditto. > >(*subqi3_carry_zext_0): Ditto. > >(*subhi3_carry_zext_0): Ditto. > >(*subqi3_carry_zext_0r): Ditto. > >(*subhi3_carry_zext_0r): Ditto. > > > > gcc/testsuite/ChangeLog: > > > >* gcc.target/i386/pr113729.c: Add test for sub. > >* gcc.target/i386/pr113729-adc-sbb.c: New test. > Ok. -- BR, Hongtao
Re: [PATCH 1/4] i386: Optimization for APX NDD is always zero-uppered for ADD
On Mon, Aug 12, 2024 at 3:10 PM kong lingling wrote: > > For APX instruction with an NDD, the destination GPR will get the > instruction’s result in bits [OSIZE-1:0] and, if OSIZE < 64b, have its upper > bits [63:OSIZE] zeroed. Now supporting other NDD instructions. > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk? Ok. -- BR, Hongtao
Re: [PATCH] Move ix86_align_loops into a separate pass and insert the pass after pass_endbr_and_patchable_area.
On Mon, Aug 12, 2024 at 10:10 PM liuhongt wrote: > > > Are there any assumptions that BB_HEAD must be a note or label? > > Maybe we should move ix86_align_loops into a separate pass and insert > > the pass just before pass_final. > The patch inserts .p2align after endbr pass, it can also fix the issue. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Any comments? Committed > > gcc/ChangeLog: > > PR target/116174 > * config/i386/i386.cc (ix86_align_loops): Move this to .. > * config/i386/i386-features.cc (ix86_align_loops): .. here. > (class pass_align_tight_loops): New class. > (make_pass_align_tight_loops): New function. > * config/i386/i386-passes.def: Insert pass_align_tight_loops > after pass_insert_endbr_and_patchable_area. > * config/i386/i386-protos.h (make_pass_align_tight_loops): New > declare. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr116174.c: New test. > --- > gcc/config/i386/i386-features.cc | 190 +++ > gcc/config/i386/i386-passes.def | 3 + > gcc/config/i386/i386-protos.h| 1 + > gcc/config/i386/i386.cc | 146 - > gcc/testsuite/gcc.target/i386/pr116174.c | 12 ++ > 5 files changed, 206 insertions(+), 146 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116174.c > > diff --git a/gcc/config/i386/i386-features.cc > b/gcc/config/i386/i386-features.cc > index c36d181f2d6..7e80e7b0103 100644 > --- a/gcc/config/i386/i386-features.cc > +++ b/gcc/config/i386/i386-features.cc > @@ -3417,6 +3417,196 @@ make_pass_apx_nf_convert (gcc::context *ctxt) >return new pass_apx_nf_convert (ctxt); > } > > +/* When a hot loop can be fit into one cacheline, > + force align the loop without considering the max skip. */ > +static void > +ix86_align_loops () > +{ > + basic_block bb; > + > + /* Don't do this when we don't know cache line size. */ > + if (ix86_cost->prefetch_block == 0) > +return; > + > + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); > + profile_count count_threshold = cfun->cfg->count_max / > param_align_threshold; > + FOR_EACH_BB_FN (bb, cfun) > +{ > + rtx_insn *label = BB_HEAD (bb); > + bool has_fallthru = 0; > + edge e; > + edge_iterator ei; > + > + if (!LABEL_P (label)) > + continue; > + > + profile_count fallthru_count = profile_count::zero (); > + profile_count branch_count = profile_count::zero (); > + > + FOR_EACH_EDGE (e, ei, bb->preds) > + { > + if (e->flags & EDGE_FALLTHRU) > + has_fallthru = 1, fallthru_count += e->count (); > + else > + branch_count += e->count (); > + } > + > + if (!fallthru_count.initialized_p () || !branch_count.initialized_p ()) > + continue; > + > + if (bb->loop_father > + && bb->loop_father->latch != EXIT_BLOCK_PTR_FOR_FN (cfun) > + && (has_fallthru > + ? (!(single_succ_p (bb) > + && single_succ (bb) == EXIT_BLOCK_PTR_FOR_FN (cfun)) > +&& optimize_bb_for_speed_p (bb) > +&& branch_count + fallthru_count > count_threshold > +&& (branch_count > fallthru_count * > param_align_loop_iterations)) > + /* In case there'no fallthru for the loop. > +Nops inserted won't be executed. */ > + : (branch_count > count_threshold > +|| (bb->count > bb->prev_bb->count * 10 > +&& (bb->prev_bb->count > +<= ENTRY_BLOCK_PTR_FOR_FN (cfun)->count / 2) > + { > + rtx_insn* insn, *end_insn; > + HOST_WIDE_INT size = 0; > + bool padding_p = true; > + basic_block tbb = bb; > + unsigned cond_branch_num = 0; > + bool detect_tight_loop_p = false; > + > + for (unsigned int i = 0; i != bb->loop_father->num_nodes; > + i++, tbb = tbb->next_bb) > + { > + /* Only handle continuous cfg layout. */ > + if (bb->loop_father != tbb->loop_father) > + { > + padding_p = false; > + break; > + } > + > + FOR_BB_INSNS (tbb, insn) > + { > + if (!NONDEBUG_INSN_P (insn)) > + continue; > + size += ix86_min_insn_size (insn); > + > + /* We don't know size of inline asm. > +Don't align loop for call. */ > + if (asm_noperands (PATTERN (insn)) >= 0 > + || CALL_P (insn)) > + { > + size = -1; > + break; > + } > + } > + > + if (size == -1 || size > ix86_cost->prefetch_block) > + { > + padding_p = false; > + break; > + } >
Re: [PATCH 0/1] Initial support for AVX10.2
On Thu, Aug 1, 2024 at 3:50 PM Haochen Jiang wrote: > > Hi all, > > AVX10.2 tech details has been just published on July 31st in the > following link: > > https://cdrdv2.intel.com/v1/dl/getContent/828965 > > For new features and instructions, we could divide them into two parts. > One is ymm rounding control, the other is the new instructions. > > In the following weeks, we plan to upstream ymm rounding part first, > following by new instructions. After all of them upstreamed, we will > also upstream several patches optimizing codegen with new AVX10.2 > instructions. > > The patch coming next is the initial support for AVX10.2. This patch > will be the foundation of all our patches. It adds the support for > cpuid, option, target attribute, etc. > > Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk? Ok. > > Thx, > Haochen > > -- BR, Hongtao
Re: PING: [PATCH] x86: Update BB_HEAD when aligning BB_HEAD
On Mon, Aug 12, 2024 at 6:59 AM H.J. Lu wrote: > > On Thu, Aug 8, 2024 at 6:53 PM H.J. Lu wrote: > > > > When we emit .p2align to align BB_HEAD, we must update BB_HEAD. Otherwise > > ENDBR will be inserted as the wrong place. > > > > gcc/ > > > > PR target/116174 > > * config/i386/i386.cc (ix86_align_loops): Update BB_HEAD when > > aligning BB_HEAD > > > > gcc/testsuite/ > > > > PR target/116174 > > * gcc.target/i386/pr116174.c: New test. > > > > Signed-off-by: H.J. Lu > > --- > > gcc/config/i386/i386.cc | 7 +-- > > gcc/testsuite/gcc.target/i386/pr116174.c | 12 > > 2 files changed, 17 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr116174.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 77c441893b4..ec6cc5e3548 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -23528,8 +23528,11 @@ ix86_align_loops () > > > > if (padding_p && detect_tight_loop_p) > > { > > - emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 > > (size)), > > - GEN_INT (0)), label); > > + rtx_insn *align = > > + emit_insn_before (gen_max_skip_align (GEN_INT (ceil_log2 > > (size)), > > + GEN_INT (0)), label); > > + if (BB_HEAD (bb) == label) > > + BB_HEAD (bb) = align; Are there any assumptions that BB_HEAD must be a note or label? Maybe we should move ix86_align_loops into a separate pass and insert the pass just before pass_final. > > /* End of function. */ > > if (!tbb || tbb == EXIT_BLOCK_PTR_FOR_FN (cfun)) > > break; > > diff --git a/gcc/testsuite/gcc.target/i386/pr116174.c > > b/gcc/testsuite/gcc.target/i386/pr116174.c > > new file mode 100644 > > index 000..8877d0b51af > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr116174.c > > @@ -0,0 +1,12 @@ > > +/* { dg-do compile { target *-*-linux* } } */ > > +/* { dg-options "-O2 -fcf-protection=branch" } */ > > + > > +char * > > +foo (char *dest, const char *src) > > +{ > > + while ((*dest++ = *src++) != '\0') > > +/* nothing */; > > + return --dest; > > +} > > + > > +/* { dg-final { scan-assembler "\t\.cfi_startproc\n\tendbr(32|64)\n" } } */ > > -- > > 2.45.2 > > > > PING. > > -- > H.J. -- BR, Hongtao
Re: [PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.
On Tue, Jul 30, 2024 at 11:04 AM liuhongt wrote: > > (insn 98 94 387 2 (parallel [ > (set (reg:TI 337 [ _32 ]) > (ashift:TI (reg:TI 329) > (reg:QI 521))) > (clobber (reg:CC 17 flags)) > ]) "test.c":11:13 953 {ashlti3_doubleword} > > is reloaded into > > (insn 98 452 387 2 (parallel [ > (set (reg:TI 0 ax [orig:337 _32 ] [337]) > (ashift:TI (const_int 1671291085 [0x639de0cd]) > (reg:QI 2 cx [521]))) > (clobber (reg:CC 17 flags)) > > since constraint n in the pattern accepts that. > (Not sure why reload doesn't check predicate) > > (define_insn "ashl3_doubleword" > [(set (match_operand:DWI 0 "register_operand" "=&r,&r") > (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") > (match_operand:QI 2 "nonmemory_operand" "c,c"))) > > The patch fixes the mismatch between constraint and predicate. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/116096 > * config/i386/constraints.md (Wc): New constraint for integer > 1 or -1. > * config/i386/i386.md (ashl3_doubleword): Refine > constraint with Wc. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr116096.c: New test. > --- > gcc/config/i386/constraints.md | 6 ++ > gcc/config/i386/i386.md | 2 +- > gcc/testsuite/gcc.target/i386/pr116096.c | 26 > 3 files changed, 33 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116096.c > > diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md > index 7508d7a58bd..154cbccd09e 100644 > --- a/gcc/config/i386/constraints.md > +++ b/gcc/config/i386/constraints.md > @@ -254,6 +254,12 @@ (define_constraint "Wb" >(and (match_code "const_int") > (match_test "IN_RANGE (ival, 0, 7)"))) > > +(define_constraint "Wc" > + "Integer constant -1 or 1." > + (and (match_code "const_int") > + (ior (match_test "op == constm1_rtx") > + (match_test "op == const1_rtx" > + > (define_constraint "Ww" >"Integer constant in the range 0 @dots{} 15, for 16-bit shifts." >(and (match_code "const_int") > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index 6207036a2a0..79d5de5b46a 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1" > > (define_insn "ashl3_doubleword" >[(set (match_operand:DWI 0 "register_operand" "=&r,&r") > - (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") > + (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0Wc,r") > (match_operand:QI 2 "nonmemory_operand" "c,c"))) > (clobber (reg:CC FLAGS_REG))] >"" > diff --git a/gcc/testsuite/gcc.target/i386/pr116096.c > b/gcc/testsuite/gcc.target/i386/pr116096.c > new file mode 100644 > index 000..5ef39805f58 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr116096.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile { target int128 } } */ > +/* { dg-options "-O2 -flive-range-shrinkage -fno-peephole2 -mstackrealign > -Wno-psabi" } */ > + > +typedef char U __attribute__((vector_size (32))); > +typedef unsigned V __attribute__((vector_size (32))); > +typedef __int128 W __attribute__((vector_size (32))); > +U g; > + > +W baz (); > + > +static inline U > +bar (V x, W y) > +{ > + y = y | y << (W) x; > + return (U)y; > +} > + > +void > +foo (W w) > +{ > + g = g << > +bar ((V){baz ()[1], 3, 3, 5, 7}, > +(W){w[0], ~(int) 2623676210}) >> > +bar ((V){baz ()[1]}, > +(W){-w[0], ~(int) 2623676210}); > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Fix memory constraint for APX NF
On Thu, Aug 1, 2024 at 10:03 AM Kong, Lingling wrote: > > > > > -Original Message- > > From: Liu, Hongtao > > Sent: Thursday, August 1, 2024 9:35 AM > > To: Kong, Lingling ; gcc-patches@gcc.gnu.org > > Cc: Wang, Hongyu > > Subject: RE: [PATCH] i386: Fix memory constraint for APX NF > > > > > > > > > -Original Message- > > > From: Kong, Lingling > > > Sent: Thursday, August 1, 2024 9:30 AM > > > To: gcc-patches@gcc.gnu.org > > > Cc: Liu, Hongtao ; Wang, Hongyu > > > > > > Subject: [PATCH] i386: Fix memory constraint for APX NF > > > > > > The je constraint should be used for APX NDD ADD with register source > > > operand. The jM is for APX NDD patterns with immediate operand. > > But these 2 alternatives is for Non-NDD. > The jM constraint is for the size limit of 15 byes when non-default address > space, > It also work to APX NF. The je is for TLS code with EVEX prefix for ADD, and > APX NF > also has the EVEX prefix. I see, could you also adjust apx_ndd_add_memory_operand and apx_ndd_memory_operand to apx_evex_add_memory_operand and apx_evex_memory_operand, and change the comments, but it can be a separate patch. The patch LGTM. > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > > Ok for trunk? > > > > > > gcc/ChangeLog: > > > > > > * config/i386/i386.md (nf_mem_constraint): Fixed the constraint > > > for the define_subst_attr. > > > (nf_mem_constraint): Added new define_subst_attr. > > > (*add_1): Fixed the constraint. > > > --- > > > gcc/config/i386/i386.md | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index > > > fb10fdc9f96..aa7220ee17c 100644 > > > --- a/gcc/config/i386/i386.md > > > +++ b/gcc/config/i386/i386.md > > > @@ -6500,7 +6500,8 @@ > > > (define_subst_attr "nf_name" "nf_subst" "_nf" "") (define_subst_attr > > > "nf_prefix" "nf_subst" "%{nf%} " "") (define_subst_attr "nf_condition" > > > "nf_subst" "TARGET_APX_NF" "true") -(define_subst_attr > > > "nf_mem_constraint" "nf_subst" "je" "m") > > > +(define_subst_attr "nf_add_mem_constraint" "nf_subst" "je" "m") > > > +(define_subst_attr "nf_mem_constraint" "nf_subst" "jM" "m") > > > (define_subst_attr "nf_applied" "nf_subst" "true" "false") > > > (define_subst_attr "nf_nonf_attr" "nf_subst" "noapx_nf" "*") > > > (define_subst_attr "nf_nonf_x64_attr" "nf_subst" "noapx_nf" "x64") @@ - > > 6514,7 +6515,7 @@ > > > (clobber (reg:CC FLAGS_REG))]) > > > > > > (define_insn "*add_1" > > > - [(set (match_operand:SWI48 0 "nonimmediate_operand" > > > "=rm,r,r,r,r,r,r,r") > > > + [(set (match_operand:SWI48 0 "nonimmediate_operand" > > > + "=r,r,r,r,r,r,r,r") > > > (plus:SWI48 > > > (match_operand:SWI48 1 "nonimmediate_operand" > > > "%0,0,0,r,r,rje,jM,r") > > > (match_operand:SWI48 2 "x86_64_general_operand" > > > "r,e,BM,0,le,r,e,BM")))] > > > -- > > > 2.31.1 -- BR, Hongtao
Re: [PATCH] i386: Mark target option with optimization when enabled with opt level [PR116065]
On Tue, Jul 30, 2024 at 1:05 PM Hongyu Wang wrote: > > Richard Biener 于2024年7月26日周五 19:45写道: > > > > On Fri, Jul 26, 2024 at 10:50 AM Hongyu Wang wrote: > > > > > > Hi, > > > > > > When introducing munroll-only-small-loops, the option was marked as > > > Target Save and added to -O2 default which makes attribute(optimize) > > > resets target option and causing error when cmdline has O1 and > > > funciton attribute has O2 and other target options. Mark this option > > > as Optimization to fix. > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu. > > > > > > Ok for trunk and backport down to gcc-13? > > > > Note this requires bumping LTO_minor_version on branches. > > > > Yes, as the aarch64 fix was not backported I'd like to just fix it for trunk. Ok for trunk only. > > > > gcc/ChangeLog > > > > > > PR target/116065 > > > * config/i386/i386.opt (munroll-only-small-loops): Mark as > > > Optimization instead of Save. > > > > > > gcc/testsuite/ChangeLog > > > > > > PR target/116065 > > > * gcc.target/i386/pr116065.c: New test. > > > --- > > > gcc/config/i386/i386.opt | 2 +- > > > gcc/testsuite/gcc.target/i386/pr116065.c | 24 > > > 2 files changed, 25 insertions(+), 1 deletion(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr116065.c > > > > > > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt > > > index 353fffb2343..52054bc018a 100644 > > > --- a/gcc/config/i386/i386.opt > > > +++ b/gcc/config/i386/i386.opt > > > @@ -1259,7 +1259,7 @@ Target Mask(ISA2_RAOINT) Var(ix86_isa_flags2) Save > > > Support RAOINT built-in functions and code generation. > > > > > > munroll-only-small-loops > > > -Target Var(ix86_unroll_only_small_loops) Init(0) Save > > > +Target Var(ix86_unroll_only_small_loops) Init(0) Optimization > > > Enable conservative small loop unrolling. > > > > > > mlam= > > > diff --git a/gcc/testsuite/gcc.target/i386/pr116065.c > > > b/gcc/testsuite/gcc.target/i386/pr116065.c > > > new file mode 100644 > > > index 000..083e70f2413 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/pr116065.c > > > @@ -0,0 +1,24 @@ > > > +/* PR target/116065 */ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O1 -mno-avx" } */ > > > + > > > +#ifndef __AVX__ > > > +#pragma GCC push_options > > > +#pragma GCC target("avx") > > > +#define __DISABLE_AVX__ > > > +#endif /* __AVX__ */ > > > + > > > +extern inline double __attribute__((__gnu_inline__,__always_inline__)) > > > + foo (double x) { return x; } > > > + > > > +#ifdef __DISABLE_AVX__ > > > +#undef __DISABLE_AVX__ > > > +#pragma GCC pop_options > > > +#endif /* __DISABLE_AVX__ */ > > > + > > > +void __attribute__((target ("avx"), optimize(3))) > > > +bar (double *p) > > > +{ > > > + *p = foo (*p); > > > +} > > > + > > > -- > > > 2.31.1 > > > -- BR, Hongtao
Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS
On Wed, Jul 31, 2024 at 3:17 PM Uros Bizjak wrote: > > On Wed, Jul 31, 2024 at 9:11 AM Hongtao Liu wrote: > > > > On Wed, Jul 31, 2024 at 1:06 AM Uros Bizjak wrote: > > > > > > On Tue, Jul 30, 2024 at 3:00 PM Richard Biener wrote: > > > > > > > > On Tue, 30 Jul 2024, Alexander Monakov wrote: > > > > > > > > > > > > > > On Tue, 30 Jul 2024, Richard Biener wrote: > > > > > > > > > > > > Oh, and please add a small comment why we don't use XFmode here. > > > > > > > > > > > > Will do. > > > > > > > > > > > > /* Do not enable XFmode, there is padding in it and it > > > > > > suffers > > > > > >from normalization upon load like SFmode and DFmode when > > > > > >not using SSE. */ > > > > > > > > > > Is it really true? I have no evidence of FLDT performing normalization > > > > > (as mentioned in PR 114659, if it did, there would be no way to > > > > > spill/reload > > > > > x87 registers). > > > > > > > > What mangling fld performs depends on the contents of the FP control > > > > word which is awkward. IIRC there's at least a bugreport that it > > > > turns sNaN into a qNaN, it seems I was wrong about denormals > > > > (when DM is not masked). And yes, IIRC x87 instability is also > > > > related to spills (IIRC we spill in the actual mode of the reg, not in > > > > XFmode), but -fexcess-precision=standard should hopefully avoid that. > > > > It's also not clear whether all implementations conformed to the > > > > specs wrt extended-precision format loads. > > > > > > FYI, FLDT does not mangle long-double values and does not generate > > > exceptions. Please see [1], but ignore shadowed text and instead read > > > the "Floating-Point Exceptions" section. So, as far as hardware is > > > concerned, it *can* be used to transfer 10-byte values, but I don't > > > want to judge from the compiler PoV if this is the way to go. We can > > > enable it, perhaps temporarily to experiment a bit - it is easy to > > > disable if it causes problems. > > > > > > Let's CC Intel folks for their opinion, if it is worth using an aging > > > x87 to transfer 80-bit data. > > I prefer not, in another hook ix86_can_change_mode_class, we have > > > > 20372 /* x87 registers can't do subreg at all, as all values are > > reformatted > > 20373 to extended precision. */ > > 20374 if (MAYBE_FLOAT_CLASS_P (regclass)) > > 20375return false; > > No, the above applies to SFmode subreg of XFmode value, which is a > no-go. My question refers to the plain XFmode (80-bit) moves, where > x87 is used simply to: > > fldt mem1 > ... > fstp mem2 > > where x87 is used to perform a move from one 80-bit location to the other. > > > I guess it eventually needs reload for XFmode. > > There are no reloads, as we would like to perform bit-exact 80-bit > move, e.g. array of 10 chars. Oh, It's memory copy. I suspect that the hardware doesn't enable memory renaming for x87 instructions. So I prefer not. > > Uros. -- BR, Hongtao
Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS
On Wed, Jul 31, 2024 at 1:06 AM Uros Bizjak wrote: > > On Tue, Jul 30, 2024 at 3:00 PM Richard Biener wrote: > > > > On Tue, 30 Jul 2024, Alexander Monakov wrote: > > > > > > > > On Tue, 30 Jul 2024, Richard Biener wrote: > > > > > > > > Oh, and please add a small comment why we don't use XFmode here. > > > > > > > > Will do. > > > > > > > > /* Do not enable XFmode, there is padding in it and it suffers > > > >from normalization upon load like SFmode and DFmode when > > > >not using SSE. */ > > > > > > Is it really true? I have no evidence of FLDT performing normalization > > > (as mentioned in PR 114659, if it did, there would be no way to > > > spill/reload > > > x87 registers). > > > > What mangling fld performs depends on the contents of the FP control > > word which is awkward. IIRC there's at least a bugreport that it > > turns sNaN into a qNaN, it seems I was wrong about denormals > > (when DM is not masked). And yes, IIRC x87 instability is also > > related to spills (IIRC we spill in the actual mode of the reg, not in > > XFmode), but -fexcess-precision=standard should hopefully avoid that. > > It's also not clear whether all implementations conformed to the > > specs wrt extended-precision format loads. > > FYI, FLDT does not mangle long-double values and does not generate > exceptions. Please see [1], but ignore shadowed text and instead read > the "Floating-Point Exceptions" section. So, as far as hardware is > concerned, it *can* be used to transfer 10-byte values, but I don't > want to judge from the compiler PoV if this is the way to go. We can > enable it, perhaps temporarily to experiment a bit - it is easy to > disable if it causes problems. > > Let's CC Intel folks for their opinion, if it is worth using an aging > x87 to transfer 80-bit data. I prefer not, in another hook ix86_can_change_mode_class, we have 20372 /* x87 registers can't do subreg at all, as all values are reformatted 20373 to extended precision. */ 20374 if (MAYBE_FLOAT_CLASS_P (regclass)) 20375return false; I guess it eventually needs reload for XFmode. > > [1] https://www.felixcloutier.com/x86/fld > > Uros. -- BR, Hongtao
Re: [PATCH] i386: Remove ndd support for *add_4 [PR113744]
On Wed, Jul 31, 2024 at 2:08 PM Kong, Lingling wrote: > > *add_4 and *adddi_4 are for shorter opcode from cmp to inc/dec or add > $128. > > But NDD code is longer than the cmp code, so there is no need to support NDD. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk? Ok. > > > > gcc/ChangeLog: > > > >PR target/113744 > >* config/i386/i386.md (*add_4): Remove NDD support. > >(*adddi_4): Ditto. > > > > Co-Authored-By: Hu, Lin1 lin1...@intel.com > > --- > > gcc/config/i386/i386.md | 40 +++- > > 1 file changed, 15 insertions(+), 25 deletions(-) > > > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > > index fb10fdc9f96..3c293c14656 100644 > > --- a/gcc/config/i386/i386.md > > +++ b/gcc/config/i386/i386.md > > @@ -7146,35 +7146,31 @@ > > (define_insn "*adddi_4" > >[(set (reg FLAGS_REG) > > (compare > > -(match_operand:DI 1 "nonimmediate_operand" "0,rm") > > -(match_operand:DI 2 "x86_64_immediate_operand" "e,e"))) > > - (clobber (match_scratch:DI 0 "=r,r"))] > > + (match_operand:DI 1 "nonimmediate_operand" "0") > > + (match_operand:DI 2 "x86_64_immediate_operand" "e"))) > > + (clobber (match_scratch:DI 0 "=r"))] > >"TARGET_64BIT > > && ix86_match_ccmode (insn, CCGCmode)" > > { > > - bool use_ndd = get_attr_isa (insn) == ISA_APX_NDD; > >switch (get_attr_type (insn)) > > { > > case TYPE_INCDEC: > >if (operands[2] == constm1_rtx) > > -return use_ndd ? "inc{q}\t{%1, %0|%0, %1}" : "inc{q}\t%0"; > > + return "inc{q}\t%0"; > >else > > { > > gcc_assert (operands[2] == const1_rtx); > > -return use_ndd ? "dec{q}\t{%1, %0|%0, %1}" : "dec{q}\t%0"; > > + return "dec{q}\t%0"; > > } > > default: > >if (x86_maybe_negate_const_int (&operands[2], DImode)) > > - return use_ndd ? "add{q}\t{%2, %1, %0|%0, %1, %2}" > > -: "add{q}\t{%2, %0|%0, %2}"; > > + return "add{q}\t{%2, %0|%0, %2}"; > > - return use_ndd ? "sub{q}\t{%2, %1, %0|%0, %1, %2}" > > - : "sub{q}\t{%2, %0|%0, %2}"; > > + return "sub{q}\t{%2, %0|%0, %2}"; > > } > > } > > - [(set_attr "isa" "*,apx_ndd") > > - (set (attr "type") > > + [(set (attr "type") > > (if_then_else (match_operand:DI 2 "incdec_operand") > > (const_string "incdec") > > (const_string "alu"))) > > @@ -7195,36 +7191,30 @@ > > (define_insn "*add_4" > >[(set (reg FLAGS_REG) > > (compare > > -(match_operand:SWI124 1 "nonimmediate_operand" "0,rm") > > + (match_operand:SWI124 1 "nonimmediate_operand" "0") > > (match_operand:SWI124 2 "const_int_operand"))) > > - (clobber (match_scratch:SWI124 0 "=,r"))] > > + (clobber (match_scratch:SWI124 0 "="))] > >"ix86_match_ccmode (insn, CCGCmode)" > > { > > - bool use_ndd = get_attr_isa (insn) == ISA_APX_NDD; > >switch (get_attr_type (insn)) > > { > > case TYPE_INCDEC: > >if (operands[2] == constm1_rtx) > > -return use_ndd ? "inc{}\t{%1, %0|%0, %1}" > > -: "inc{}\t%0"; > > +return "inc{}\t%0"; > >else > > { > > gcc_assert (operands[2] == const1_rtx); > > -return use_ndd ? "dec{}\t{%1, %0|%0, %1}" > > -: "dec{}\t%0"; > > + return "dec{}\t%0"; > > } > > default: > >if (x86_maybe_negate_const_int (&operands[2], mode)) > > - return use_ndd ? "add{}\t{%2, %1, %0|%0, %1, %2}" > > -: "add{}\t{%2, %0|%0, %2}"; > > + return "add{}\t{%2, %0|%0, %2}"; > > - return use_ndd ? "sub{}\t{%2, %1, %0|%0, %1, %2}" > > - : "sub{}\t{%2, %0|%0, %2}"; > > + return "sub{}\t{%2, %0|%0, %2}"; > > } > > } > > - [(set_attr "isa" "*,apx_ndd") > > - (set (attr "type") > > + [(set (attr "type") > > (if_then_else (match_operand: 2 "incdec_operand") > > (const_string "incdec") > > (const_string "alu"))) > > -- > > 2.31.1 -- BR, Hongtao
Re: [PATCH v2] i386: Add non-optimize prefetchi intrins
On Tue, Jul 30, 2024 at 9:27 AM Hongtao Liu wrote: > > On Fri, Jul 26, 2024 at 4:55 PM Haochen Jiang wrote: > > > > Hi all, > > > > I added related O0 testcase in this patch. > > > > Ok for trunk and backport to GCC 14 and GCC 13? > Ok. I mean for trunk, and it needs jakub's approval to backport to GCC14.2. > > > > Thx, > > Haochen > > > > --- > > > > Changes in v2: Add testcases. > > > > --- > > > > Under -O0, with the "newly" introduced intrins, the variable will be > > transformed as mem instead of the origin symbol_ref. The compiler will > > then treat the operand as invalid and turn the operation into nop, which > > is not expected. Use macro for non-optimize to keep the variable as > > symbol_ref just as how prefetch intrin does. > > > > gcc/ChangeLog: > > > > * config/i386/prfchiintrin.h > > (_m_prefetchit0): Add macro for non-optimized option. > > (_m_prefetchit1): Ditto. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/prefetchi-1b.c: New test. > > --- > > gcc/config/i386/prfchiintrin.h | 9 +++ > > gcc/testsuite/gcc.target/i386/prefetchi-1b.c | 26 > > 2 files changed, 35 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/i386/prefetchi-1b.c > > > > diff --git a/gcc/config/i386/prfchiintrin.h b/gcc/config/i386/prfchiintrin.h > > index dfca89c7d16..d6580e504c0 100644 > > --- a/gcc/config/i386/prfchiintrin.h > > +++ b/gcc/config/i386/prfchiintrin.h > > @@ -37,6 +37,7 @@ > > #define __DISABLE_PREFETCHI__ > > #endif /* __PREFETCHI__ */ > > > > +#ifdef __OPTIMIZE__ > > extern __inline void > > __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > > _m_prefetchit0 (void* __P) > > @@ -50,6 +51,14 @@ _m_prefetchit1 (void* __P) > > { > >__builtin_ia32_prefetchi (__P, 2); > > } > > +#else > > +#define _m_prefetchit0(P) \ > > + __builtin_ia32_prefetchi(P, 3); > > + > > +#define _m_prefetchit1(P) \ > > + __builtin_ia32_prefetchi(P, 2); > > + > > +#endif > > > > #ifdef __DISABLE_PREFETCHI__ > > #undef __DISABLE_PREFETCHI__ > > diff --git a/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > > b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > > new file mode 100644 > > index 000..93139554d3c > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > > @@ -0,0 +1,26 @@ > > +/* { dg-do compile { target { ! ia32 } } } */ > > +/* { dg-options "-mprefetchi -O0" } */ > > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ > > \\t\]+bar\\(%rip\\)" 1 } } */ > > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ > > \\t\]+bar\\(%rip\\)" 1 } } */ > > + > > +#include > > + > > +int > > +bar (int a) > > +{ > > + return a + 1; > > +} > > + > > +int > > +foo1 (int b) > > +{ > > + _m_prefetchit0 (bar); > > + return bar (b) + 1; > > +} > > + > > +int > > +foo2 (int b) > > +{ > > + _m_prefetchit1 (bar); > > + return bar (b) + 1; > > +} > > -- > > 2.31.1 > > > > > -- > BR, > Hongtao -- BR, Hongtao
Re: [PATCH v2] i386: Add non-optimize prefetchi intrins
On Fri, Jul 26, 2024 at 4:55 PM Haochen Jiang wrote: > > Hi all, > > I added related O0 testcase in this patch. > > Ok for trunk and backport to GCC 14 and GCC 13? Ok. > > Thx, > Haochen > > --- > > Changes in v2: Add testcases. > > --- > > Under -O0, with the "newly" introduced intrins, the variable will be > transformed as mem instead of the origin symbol_ref. The compiler will > then treat the operand as invalid and turn the operation into nop, which > is not expected. Use macro for non-optimize to keep the variable as > symbol_ref just as how prefetch intrin does. > > gcc/ChangeLog: > > * config/i386/prfchiintrin.h > (_m_prefetchit0): Add macro for non-optimized option. > (_m_prefetchit1): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/prefetchi-1b.c: New test. > --- > gcc/config/i386/prfchiintrin.h | 9 +++ > gcc/testsuite/gcc.target/i386/prefetchi-1b.c | 26 > 2 files changed, 35 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/i386/prefetchi-1b.c > > diff --git a/gcc/config/i386/prfchiintrin.h b/gcc/config/i386/prfchiintrin.h > index dfca89c7d16..d6580e504c0 100644 > --- a/gcc/config/i386/prfchiintrin.h > +++ b/gcc/config/i386/prfchiintrin.h > @@ -37,6 +37,7 @@ > #define __DISABLE_PREFETCHI__ > #endif /* __PREFETCHI__ */ > > +#ifdef __OPTIMIZE__ > extern __inline void > __attribute__((__gnu_inline__, __always_inline__, __artificial__)) > _m_prefetchit0 (void* __P) > @@ -50,6 +51,14 @@ _m_prefetchit1 (void* __P) > { >__builtin_ia32_prefetchi (__P, 2); > } > +#else > +#define _m_prefetchit0(P) \ > + __builtin_ia32_prefetchi(P, 3); > + > +#define _m_prefetchit1(P) \ > + __builtin_ia32_prefetchi(P, 2); > + > +#endif > > #ifdef __DISABLE_PREFETCHI__ > #undef __DISABLE_PREFETCHI__ > diff --git a/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > new file mode 100644 > index 000..93139554d3c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/prefetchi-1b.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-mprefetchi -O0" } */ > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit0\[ > \\t\]+bar\\(%rip\\)" 1 } } */ > +/* { dg-final { scan-assembler-times "\[ \\t\]+prefetchit1\[ > \\t\]+bar\\(%rip\\)" 1 } } */ > + > +#include > + > +int > +bar (int a) > +{ > + return a + 1; > +} > + > +int > +foo1 (int b) > +{ > + _m_prefetchit0 (bar); > + return bar (b) + 1; > +} > + > +int > +foo2 (int b) > +{ > + _m_prefetchit1 (bar); > + return bar (b) + 1; > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.
On Thu, Jul 25, 2024 at 3:23 PM Hongtao Liu wrote: > > On Wed, Jul 24, 2024 at 3:57 PM liuhongt wrote: > > > > For below pattern, RA may still allocate r162 as v/k register, try to > > reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi > > which result a linker error. > > > > (set (reg:DI 162) > > (mem/u/c:DI > >(const:DI (unspec:DI > > [(symbol_ref:DI ("a") [flags 0x60] > 0x7f621f6e1c60 a>)] > > UNSPEC_GOTNTPOFF)) > > > > Quote from H.J for why linker issue an error. > > >What do these do: > > > > > >leaq__libc_tsd_CTYPE_B@gottpoff(%rip), %rax > > >vmovq (%rax), %xmm0 > > > > > >From x86-64 TLS psABI: > > > > > >The assembler generates for the x@gottpoff(%rip) expressions a R X86 > > >64 GOTTPOFF relocation for the symbol x which requests the linker to > > >generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of > > >the GOT entry relative to the end of the instruction is then used in > > >the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at > > >program startup time by the dynamic linker by looking up the symbol x > > >in the modules loaded at that point. The offset is written in the GOT > > >entry and later loaded by the addq instruction. > > > > > >The above code sequence looks wrong to me. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk and backport? Committed and will backport after gcc14.2 is released. > > > > gcc/ChangeLog: > > > > PR target/116043 > > * config/i386/constraints.md (Bk): Refine to > > define_special_memory_constraint. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/pr116043.c: New test. > > --- > > gcc/config/i386/constraints.md | 2 +- > > gcc/testsuite/gcc.target/i386/pr116043.c | 33 > > 2 files changed, 34 insertions(+), 1 deletion(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr116043.c > > > > diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md > > index 7508d7a58bd..b760e7c221a 100644 > > --- a/gcc/config/i386/constraints.md > > +++ b/gcc/config/i386/constraints.md > > @@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm" > >"@internal Vector memory operand." > >(match_operand 0 "vector_memory_operand")) > > > > -(define_memory_constraint "Bk" > > +(define_special_memory_constraint "Bk" > >"@internal TLS address that allows insn using non-integer registers." > >(and (match_operand 0 "memory_operand") > > (not (match_test "ix86_gpr_tls_address_pattern_p (op)" > > diff --git a/gcc/testsuite/gcc.target/i386/pr116043.c > > b/gcc/testsuite/gcc.target/i386/pr116043.c > > new file mode 100644 > > index 000..76553496c10 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr116043.c > > @@ -0,0 +1,33 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-mavx512bf16 -O3" } */ > > +/* { dg-final { scan-assembler-not {(?n)lea.*@gottpoff} } } */ > > + > > +extern __thread int a, c, i, j, k, l; > > +int *b; > > +struct d { > > + int e; > > +} f, g; > > +char *h; > > + > > +void m(struct d *n) { > > + b = &k; > > + for (; n->e; b++, n--) { > > +i = b && a; > > +if (i) > > + j = c; > > + } > > +} > > + > > +char *o(struct d *n) { > > + for (; n->e;) > > +return h; > > +} > > + > > +int q() { > > + if (l) > > +return 1; > > + int p = *o(&g); > > + m(&f); > > + m(&g); > > + l = p; > > +} > > -- > > 2.31.1 > > > > > -- > BR, > Hongtao -- BR, Hongtao
Re: [PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.
On Fri, Jul 26, 2024 at 2:59 PM liuhongt wrote: > > (insn 98 94 387 2 (parallel [ > (set (reg:TI 337 [ _32 ]) > (ashift:TI (reg:TI 329) > (reg:QI 521))) > (clobber (reg:CC 17 flags)) > ]) "test.c":11:13 953 {ashlti3_doubleword} > > is reloaded into > > (insn 98 452 387 2 (parallel [ > (set (reg:TI 0 ax [orig:337 _32 ] [337]) > (ashift:TI (const_int 1671291085 [0x639de0cd]) > (reg:QI 2 cx [521]))) > (clobber (reg:CC 17 flags)) > > since constraint n in the pattern accepts that. > (Not sure why reload doesn't check predicate) > > (define_insn "ashl3_doubleword" > [(set (match_operand:DWI 0 "register_operand" "=&r,&r") > (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") > (match_operand:QI 2 "nonmemory_operand" "c,c"))) > > The patch fixes the mismatch between constraint and predicate. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? Please ignore this, I need to support 1 in the constraint. > > gcc/ChangeLog: > > PR target/116096 > * config/i386/constraints.md (BC): Move TARGET_SSE to > vector_all_ones_operand. > * config/i386/i386.md (ashl3_doubleword): Refine > constraint with BC. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr116096.c: New test. > --- > gcc/config/i386/constraints.md | 4 ++-- > gcc/config/i386/i386.md | 2 +- > gcc/testsuite/gcc.target/i386/pr116096.c | 26 > 3 files changed, 29 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116096.c > > diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md > index 7508d7a58bd..fd032c2b9f0 100644 > --- a/gcc/config/i386/constraints.md > +++ b/gcc/config/i386/constraints.md > @@ -225,8 +225,8 @@ (define_constraint "Bz" > > (define_constraint "BC" >"@internal integer SSE constant with all bits set operand." > - (and (match_test "TARGET_SSE") > - (ior (match_test "op == constm1_rtx") > + (ior (match_test "op == constm1_rtx") > + (and (match_test "TARGET_SSE") > (match_operand 0 "vector_all_ones_operand" > > (define_constraint "BF" > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index 6207036a2a0..9c4e847fba1 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1" > > (define_insn "ashl3_doubleword" >[(set (match_operand:DWI 0 "register_operand" "=&r,&r") > - (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") > + (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r") > (match_operand:QI 2 "nonmemory_operand" "c,c"))) > (clobber (reg:CC FLAGS_REG))] >"" > diff --git a/gcc/testsuite/gcc.target/i386/pr116096.c > b/gcc/testsuite/gcc.target/i386/pr116096.c > new file mode 100644 > index 000..5ef39805f58 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr116096.c > @@ -0,0 +1,26 @@ > +/* { dg-do compile { target int128 } } */ > +/* { dg-options "-O2 -flive-range-shrinkage -fno-peephole2 -mstackrealign > -Wno-psabi" } */ > + > +typedef char U __attribute__((vector_size (32))); > +typedef unsigned V __attribute__((vector_size (32))); > +typedef __int128 W __attribute__((vector_size (32))); > +U g; > + > +W baz (); > + > +static inline U > +bar (V x, W y) > +{ > + y = y | y << (W) x; > + return (U)y; > +} > + > +void > +foo (W w) > +{ > + g = g << > +bar ((V){baz ()[1], 3, 3, 5, 7}, > +(W){w[0], ~(int) 2623676210}) >> > +bar ((V){baz ()[1]}, > +(W){-w[0], ~(int) 2623676210}); > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH Ping] i386: Use BLKmode for {ld,st}tilecfg
On Fri, Jul 26, 2024 at 2:28 PM Jiang, Haochen wrote: > > Ping for this patch > > Thx, > Haochen > > > -Original Message- > > From: Haochen Jiang > > Sent: Thursday, July 18, 2024 9:45 AM > > To: gcc-patches@gcc.gnu.org > > Cc: Liu, Hongtao ; hjl.to...@gmail.com; > > ubiz...@gmail.com > > Subject: [PATCH] i386: Use BLKmode for {ld,st}tilecfg > > > > Hi all, > > > > For AMX instructions related with memory, we will treat the memory > > size as not specified since there won't be different size causing > > confusion for memory. > > > > This will change the output under Intel mode, which is broken for now when > > using with assembler and aligns to current binutils behavior. > > > > Bootstrapped and regtested on x86-64-pc-linux-gnu. Ok for trunk? Ok. > > > > Thx, > > Haochen > > > > gcc/ChangeLog: > > > > * config/i386/i386-expand.cc (ix86_expand_builtin): Change > > from XImode to BLKmode. > > * config/i386/i386.md (ldtilecfg): Change XI to BLK. > > (sttilecfg): Ditto. > > --- > > gcc/config/i386/i386-expand.cc | 2 +- > > gcc/config/i386/i386.md| 12 +--- > > 2 files changed, 6 insertions(+), 8 deletions(-) > > > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > > index 9a31e6df2aa..d9ad06264aa 100644 > > --- a/gcc/config/i386/i386-expand.cc > > +++ b/gcc/config/i386/i386-expand.cc > > @@ -14198,7 +14198,7 @@ ix86_expand_builtin (tree exp, rtx target, rtx > > subtarget, > > op0 = convert_memory_address (Pmode, op0); > > op0 = copy_addr_to_reg (op0); > > } > > - op0 = gen_rtx_MEM (XImode, op0); > > + op0 = gen_rtx_MEM (BLKmode, op0); > >if (fcode == IX86_BUILTIN_LDTILECFG) > > icode = CODE_FOR_ldtilecfg; > >else > > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > > index de9f4ba0496..86989d4875a 100644 > > --- a/gcc/config/i386/i386.md > > +++ b/gcc/config/i386/i386.md > > @@ -28975,24 +28975,22 @@ > > (set_attr "type" "other")]) > > > > (define_insn "ldtilecfg" > > - [(unspec_volatile [(match_operand:XI 0 "memory_operand" "m")] > > + [(unspec_volatile [(match_operand:BLK 0 "memory_operand" "m")] > > UNSPECV_LDTILECFG)] > >"TARGET_AMX_TILE" > >"ldtilecfg\t%0" > >[(set_attr "type" "other") > > (set_attr "prefix" "maybe_evex") > > - (set_attr "memory" "load") > > - (set_attr "mode" "XI")]) > > + (set_attr "memory" "load")]) > > > > (define_insn "sttilecfg" > > - [(set (match_operand:XI 0 "memory_operand" "=m") > > -(unspec_volatile:XI [(const_int 0)] UNSPECV_STTILECFG))] > > + [(set (match_operand:BLK 0 "memory_operand" "=m") > > +(unspec_volatile:BLK [(const_int 0)] UNSPECV_STTILECFG))] > >"TARGET_AMX_TILE" > >"sttilecfg\t%0" > >[(set_attr "type" "other") > > (set_attr "prefix" "maybe_evex") > > - (set_attr "memory" "store") > > - (set_attr "mode" "XI")]) > > + (set_attr "memory" "store")]) > > > > (include "mmx.md") > > (include "sse.md") > > -- > > 2.31.1 > -- BR, Hongtao
Re: [PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.
On Wed, Jul 24, 2024 at 3:57 PM liuhongt wrote: > > For below pattern, RA may still allocate r162 as v/k register, try to > reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi > which result a linker error. > > (set (reg:DI 162) > (mem/u/c:DI >(const:DI (unspec:DI > [(symbol_ref:DI ("a") [flags 0x60] a>)] > UNSPEC_GOTNTPOFF)) > > Quote from H.J for why linker issue an error. > >What do these do: > > > >leaq__libc_tsd_CTYPE_B@gottpoff(%rip), %rax > >vmovq (%rax), %xmm0 > > > >From x86-64 TLS psABI: > > > >The assembler generates for the x@gottpoff(%rip) expressions a R X86 > >64 GOTTPOFF relocation for the symbol x which requests the linker to > >generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of > >the GOT entry relative to the end of the instruction is then used in > >the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at > >program startup time by the dynamic linker by looking up the symbol x > >in the modules loaded at that point. The offset is written in the GOT > >entry and later loaded by the addq instruction. > > > >The above code sequence looks wrong to me. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk and backport? > > gcc/ChangeLog: > > PR target/116043 > * config/i386/constraints.md (Bk): Refine to > define_special_memory_constraint. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr116043.c: New test. > --- > gcc/config/i386/constraints.md | 2 +- > gcc/testsuite/gcc.target/i386/pr116043.c | 33 > 2 files changed, 34 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr116043.c > > diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md > index 7508d7a58bd..b760e7c221a 100644 > --- a/gcc/config/i386/constraints.md > +++ b/gcc/config/i386/constraints.md > @@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm" >"@internal Vector memory operand." >(match_operand 0 "vector_memory_operand")) > > -(define_memory_constraint "Bk" > +(define_special_memory_constraint "Bk" >"@internal TLS address that allows insn using non-integer registers." >(and (match_operand 0 "memory_operand") > (not (match_test "ix86_gpr_tls_address_pattern_p (op)" > diff --git a/gcc/testsuite/gcc.target/i386/pr116043.c > b/gcc/testsuite/gcc.target/i386/pr116043.c > new file mode 100644 > index 000..76553496c10 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr116043.c > @@ -0,0 +1,33 @@ > +/* { dg-do compile } */ > +/* { dg-options "-mavx512bf16 -O3" } */ > +/* { dg-final { scan-assembler-not {(?n)lea.*@gottpoff} } } */ > + > +extern __thread int a, c, i, j, k, l; > +int *b; > +struct d { > + int e; > +} f, g; > +char *h; > + > +void m(struct d *n) { > + b = &k; > + for (; n->e; b++, n--) { > +i = b && a; > +if (i) > + j = c; > + } > +} > + > +char *o(struct d *n) { > + for (; n->e;) > +return h; > +} > + > +int q() { > + if (l) > +return 1; > + int p = *o(&g); > + m(&f); > + m(&g); > + l = p; > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386: Adjust rtx cost for imulq and imulw [PR115749]
On Wed, Jul 24, 2024 at 3:11 PM Kong, Lingling wrote: > > Tested spec2017 performance in Sierra Forest, Icelake, CascadeLake, at least > there is no obvious regression. > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > OK for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/x86-tune-costs.h (struct processor_costs): > Adjust rtx_cost of imulq and imulw for COST_N_INSNS (4) > to COST_N_INSNS (3). > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr115749.c: New test. > --- > gcc/config/i386/x86-tune-costs.h | 16 > gcc/testsuite/gcc.target/i386/pr115749.c | 16 > 2 files changed, 24 insertions(+), 8 deletions(-) create mode 100644 > gcc/testsuite/gcc.target/i386/pr115749.c > > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 769f334e531..2bfaee554d5 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -2182,7 +2182,7 @@ struct processor_costs skylake_cost = { >COSTS_N_INSNS (1), /* variable shift costs */ >COSTS_N_INSNS (1), /* constant shift costs */ >{COSTS_N_INSNS (3), /* cost of starting multiply for QI */ > - COSTS_N_INSNS (4), /* HI */ > + COSTS_N_INSNS (3), /* HI */ > COSTS_N_INSNS (3), /* SI */ > COSTS_N_INSNS (3), /* DI */ > COSTS_N_INSNS (3)}, /*other */ > @@ -2310,7 +2310,7 @@ struct processor_costs icelake_cost = { >COSTS_N_INSNS (1), /* variable shift costs */ >COSTS_N_INSNS (1), /* constant shift costs */ >{COSTS_N_INSNS (3), /* cost of starting multiply for QI */ > - COSTS_N_INSNS (4), /* HI */ > + COSTS_N_INSNS (3), /* HI */ > COSTS_N_INSNS (3), /* SI */ > COSTS_N_INSNS (3), /* DI */ > COSTS_N_INSNS (3)}, /*other */ > @@ -2434,9 +2434,9 @@ struct processor_costs alderlake_cost = { >COSTS_N_INSNS (1), /* variable shift costs */ >COSTS_N_INSNS (1), /* constant shift costs */ >{COSTS_N_INSNS (3), /* cost of starting multiply for QI */ > - COSTS_N_INSNS (4), /* HI */ > + COSTS_N_INSNS (3), /* HI */ > COSTS_N_INSNS (3), /* SI */ > - COSTS_N_INSNS (4), /* DI */ > + COSTS_N_INSNS (3), /* DI */ > COSTS_N_INSNS (4)}, /*other */ >0, /* cost of multiply per each bit set > */ >{COSTS_N_INSNS (16), /* cost of a divide/mod for QI */ > @@ -3234,9 +3234,9 @@ struct processor_costs tremont_cost = { >COSTS_N_INSNS (1), /* variable shift costs */ >COSTS_N_INSNS (1), /* constant shift costs */ >{COSTS_N_INSNS (3), /* cost of starting multiply for QI */ > - COSTS_N_INSNS (4), /* HI */ > + COSTS_N_INSNS (3), /* HI */ > COSTS_N_INSNS (3), /* SI */ > - COSTS_N_INSNS (4), /* DI */ > + COSTS_N_INSNS (3), /* DI */ > COSTS_N_INSNS (4)}, /*other */ >0, /* cost of multiply per each bit set > */ >{COSTS_N_INSNS (16), /* cost of a divide/mod for QI */ > @@ -3816,9 +3816,9 @@ struct processor_costs generic_cost = { >COSTS_N_INSNS (1), /* variable shift costs */ >COSTS_N_INSNS (1), /* constant shift costs */ >{COSTS_N_INSNS (3), /* cost of starting multiply for QI */ > - COSTS_N_INSNS (4), /* HI */ > + COSTS_N_INSNS (3), /* HI */ > COSTS_N_INSNS (3), /* SI */ > - COSTS_N_INSNS (4), /* DI */ > + COSTS_N_INSNS (3), /* DI */ > COSTS_N_INSNS (4)}, /*other */ >0
Re: [PATCH] x86: Don't enable APX_F in 32-bit mode.
On Thu, Jul 18, 2024 at 5:29 PM Kong, Lingling wrote: > > I adjusted my patch based on the comments by H.J. > And I will add the testcase like gcc.target/i386/pr101395-1.c when the march > for APX is determined. > > Ok for trunk? Synced with LLVM folks, they agreed to this solution. Ok. > > Thanks, > Lingling > > gcc/ChangeLog: > > PR target/115978 > * config/i386/driver-i386.cc (host_detect_local_cpu): Enable > APX_F only for 64-bit codegen. > * config/i386/i386-options.cc (DEF_PTA): Skip PTA_APX_F if > not in 64-bit mode. > > gcc/testsuite/ChangeLog: > > PR target/115978 > * gcc.target/i386/pr115978-1.c: New test. > * gcc.target/i386/pr115978-2.c: Ditto. > --- > gcc/config/i386/driver-i386.cc | 3 ++- > gcc/config/i386/i386-options.cc| 3 ++- > gcc/testsuite/gcc.target/i386/pr115978-1.c | 22 ++ > gcc/testsuite/gcc.target/i386/pr115978-2.c | 6 ++ > 4 files changed, 32 insertions(+), 2 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr115978-1.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr115978-2.c > > diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc > index 11470eaea12..445f5640155 100644 > --- a/gcc/config/i386/driver-i386.cc > +++ b/gcc/config/i386/driver-i386.cc > @@ -900,7 +900,8 @@ const char *host_detect_local_cpu (int argc, const char > **argv) > if (has_feature (isa_names_table[i].feature)) > { > if (codegen_x86_64 > - || isa_names_table[i].feature != FEATURE_UINTR) > + || (isa_names_table[i].feature != FEATURE_UINTR > + && isa_names_table[i].feature != FEATURE_APX_F)) > options = concat (options, " ", > isa_names_table[i].option, NULL); > } > diff --git a/gcc/config/i386/i386-options.cc > b/gcc/config/i386/i386-options.cc index 059ef3ae6ad..1c8f7835af2 100644 > --- a/gcc/config/i386/i386-options.cc > +++ b/gcc/config/i386/i386-options.cc > @@ -2351,7 +2351,8 @@ ix86_option_override_internal (bool main_args_p, > #define DEF_PTA(NAME) \ > if (((processor_alias_table[i].flags & PTA_ ## NAME) != 0) \ > && PTA_ ## NAME != PTA_64BIT \ > - && (TARGET_64BIT || PTA_ ## NAME != PTA_UINTR) \ > + && (TARGET_64BIT || (PTA_ ## NAME != PTA_UINTR \ > +&& PTA_ ## NAME != PTA_APX_F))\ > && !TARGET_EXPLICIT_ ## NAME ## _P (opts)) \ > SET_TARGET_ ## NAME (opts); > #include "i386-isa.def" > diff --git a/gcc/testsuite/gcc.target/i386/pr115978-1.c > b/gcc/testsuite/gcc.target/i386/pr115978-1.c > new file mode 100644 > index 000..18a1c5f153a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr115978-1.c > @@ -0,0 +1,22 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -march=native" } */ > + > +int > +main () > +{ > + if (__builtin_cpu_supports ("apxf")) > +{ > +#ifdef __x86_64__ > +# ifndef __APX_F__ > + __builtin_abort (); > +# endif > +#else > +# ifdef __APX_F__ > + __builtin_abort (); > +# endif > +#endif > + return 0; > +} > + > + return 0; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr115978-2.c > b/gcc/testsuite/gcc.target/i386/pr115978-2.c > new file mode 100644 > index 000..900d6eb096a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr115978-2.c > @@ -0,0 +1,6 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=native -mno-apxf" } */ > + > +#ifdef __APX_F__ > +# error APX_F should be disabled > +#endif > -- > 2.31.1 > -- BR, Hongtao
Re: [PATCH] i386, testsuite: Fix non-Unicode character
On Mon, Jul 15, 2024 at 7:24 PM Paul-Antoine Arras wrote: > > This trivially fixes an incorrectly encoded character in the DejaGnu > scan pattern. > > OK for trunk? Ok. > -- > PA -- BR, Hongtao
Re: [PATCH] i386: extend trunc{128}2{16,32,64}'s scope.
On Mon, Jul 15, 2024 at 1:39 PM Hu, Lin1 wrote: > > Hi, all > > Based on actual usage, trunc{128}2{16,32,64} use some instructions from > sse/sse3, so extend their scope to extend the scope of optimization. > > Bootstraped and regtest on x86-64-linux-gnu, OK for trunk? Ok. > > BRs, > Lin > > gcc/ChangeLog: > > PR target/107432 > * config/i386/sse.md > (PMOV_SRC_MODE_3_AVX2): Add TARGET_AVX2 for V4DI and V8SI. > (PMOV_SRC_MODE_4): Add TARGET_AVX2 for V4DI. > (trunc2): Change constraint from TARGET_AVX2 > to > TARGET_SSSE3. > (trunc2): Ditto. > (truncv2div2si2): Change constraint from TARGET_AVX2 to TARGET_SSE. > > gcc/testsuite/ChangeLog: > > PR target/107432 > * gcc.target/i386/pr107432-10.c: New test. > --- > gcc/config/i386/sse.md | 11 +++--- > gcc/testsuite/gcc.target/i386/pr107432-10.c | 41 + > 2 files changed, 47 insertions(+), 5 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-10.c > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index b3b4697924b..72f3c7df297 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -15000,7 +15000,8 @@ (define_expand > "_2_mask_store" >"TARGET_AVX512VL") > > (define_mode_iterator PMOV_SRC_MODE_3 [V4DI V2DI V8SI V4SI (V8HI > "TARGET_AVX512BW")]) > -(define_mode_iterator PMOV_SRC_MODE_3_AVX2 [V4DI V2DI V8SI V4SI V8HI]) > +(define_mode_iterator PMOV_SRC_MODE_3_AVX2 > + [(V4DI "TARGET_AVX2") V2DI (V8SI "TARGET_AVX2") V4SI V8HI]) > (define_mode_attr pmov_dst_3_lower >[(V4DI "v4qi") (V2DI "v2qi") (V8SI "v8qi") (V4SI "v4qi") (V8HI "v8qi")]) > (define_mode_attr pmov_dst_3 > @@ -15014,7 +15015,7 @@ (define_expand "trunc2" >[(set (match_operand: 0 "register_operand") > (truncate: > (match_operand:PMOV_SRC_MODE_3_AVX2 1 "register_operand")))] > - "TARGET_AVX2" > + "TARGET_SSSE3" > { >if (TARGET_AVX512VL >&& (mode != V8HImode || TARGET_AVX512BW)) > @@ -15390,7 +15391,7 @@ (define_insn_and_split > "avx512vl_v8qi2_mask_store_2" > (match_dup 2)))] >"operands[0] = adjust_address_nv (operands[0], V8QImode, 0);") > > -(define_mode_iterator PMOV_SRC_MODE_4 [V4DI V2DI V4SI]) > +(define_mode_iterator PMOV_SRC_MODE_4 [(V4DI "TARGET_AVX2") V2DI V4SI]) > (define_mode_attr pmov_dst_4 >[(V4DI "V4HI") (V2DI "V2HI") (V4SI "V4HI")]) > (define_mode_attr pmov_dst_4_lower > @@ -15404,7 +15405,7 @@ (define_expand "trunc2" >[(set (match_operand: 0 "register_operand") > (truncate: > (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))] > - "TARGET_AVX2" > + "TARGET_SSSE3" > { >if (TARGET_AVX512VL) > { > @@ -15659,7 +15660,7 @@ (define_expand "truncv2div2si2" >[(set (match_operand:V2SI 0 "register_operand") > (truncate:V2SI > (match_operand:V2DI 1 "register_operand")))] > - "TARGET_AVX2" > + "TARGET_SSE" > { >if (TARGET_AVX512VL) > { > diff --git a/gcc/testsuite/gcc.target/i386/pr107432-10.c > b/gcc/testsuite/gcc.target/i386/pr107432-10.c > new file mode 100644 > index 000..57edf7cfc78 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr107432-10.c > @@ -0,0 +1,41 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=x86-64-v2 -O2" } */ > +/* { dg-final { scan-assembler-times "shufps" 1 } } */ > +/* { dg-final { scan-assembler-times "pshufb" 5 } } */ > + > +#include > + > +typedef short __v2hi __attribute__ ((__vector_size__ (4))); > +typedef char __v2qi __attribute__ ((__vector_size__ (2))); > +typedef char __v4qi __attribute__ ((__vector_size__ (4))); > +typedef char __v8qi __attribute__ ((__vector_size__ (8))); > + > +__v2si mm_cvtepi64_epi32_builtin_convertvector(__v2di a) > +{ > + return __builtin_convertvector((__v2di)a, __v2si); > +} > + > +__v2hi mm_cvtepi64_epi16_builtin_convertvector(__m128i a) > +{ > + return __builtin_convertvector((__v2di)a, __v2hi); > +} > + > +__v4hi mm_cvtepi32_epi16_builtin_convertvector(__m128i a) > +{ > + return __builtin_convertvector((__v4si)a, __v4hi); > +} > + > +__v2qi mm_cvtepi64_epi8_builtin_convertvector(__m128i a) > +{ > + return __builtin_convertvector((__v2di)a, __v2qi); > +} > + > +__v4qi mm_cvtepi32_epi8_builtin_convertvector(__m128i a) > +{ > + return __builtin_convertvector((__v4si)a, __v4qi); > +} > + > +__v8qi mm_cvtepi16_epi8_builtin_convertvector(__m128i a) > +{ > + return __builtin_convertvector((__v8hi)a, __v8qi); > +} > -- > 2.31.1 > -- BR, Hongtao
Re: [i386] adjust flag_omit_frame_pointer in a single function [PR113719] (was: Re: [PATCH] [i386] restore recompute to override opts after change [PR113719])
On Thu, Jul 11, 2024 at 9:07 PM Alexandre Oliva wrote: > > On Jul 4, 2024, Alexandre Oliva wrote: > > > On Jul 3, 2024, Rainer Orth wrote: > > > Hmm, I wonder if leaf frame pointer has to do with that. > > It did, in a way. > > > > The first two patches for PR113719 have each regressed > gcc.dg/ipa/iinline-attr.c on a different target. The reason for this > instability is that there are competing flag_omit_frame_pointer > overriders on x86: > > - ix86_recompute_optlev_based_flags computes and sets a > -f[no-]omit-frame-pointer default depending on > USE_IX86_FRAME_POINTER and, in 32-bit mode, optimize_size > > - ix86_option_override_internal enables flag_omit_frame_pointer for > -momit-leaf-frame-pointer to take effect > > ix86_option_override[_internal] calls > ix86_recompute_optlev_based_flags before setting > flag_omit_frame_pointer. It is called during global process_options. > > But ix86_recompute_optlev_based_flags is also called by > parse_optimize_options, during attribute processing, and at that > point, ix86_option_override is not called, so the final overrider for > global options is not applied to the optimize attributes. If they > differ, the testcase fails. > > In order to fix this, we need to process all overriders of this option > whenever we process any of them. Since this setting is affected by > optimization options, it makes sense to compute it in > parse_optimize_options, rather than in process_options. > > Regstrapped on x86_64-linux-gnu. Also verified that the regression is > cured with a i686-solaris cross compiler. Ok to install? Ok. thanks. > > > for gcc/ChangeLog > > PR target/113719 > * config/i386/i386-options.cc (ix86_option_override_internal): > Move flag_omit_frame_pointer final overrider... > (ix86_recompute_optlev_based_flags): ... here. > --- > gcc/config/i386/i386-options.cc | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc > index 5824c0cb072eb..059ef3ae6ad44 100644 > --- a/gcc/config/i386/i386-options.cc > +++ b/gcc/config/i386/i386-options.cc > @@ -1911,6 +1911,12 @@ ix86_recompute_optlev_based_flags (struct gcc_options > *opts, > opts->x_flag_pcc_struct_return = DEFAULT_PCC_STRUCT_RETURN; > } > } > + > + /* Keep nonleaf frame pointers. */ > + if (opts->x_flag_omit_frame_pointer) > +opts->x_target_flags &= ~MASK_OMIT_LEAF_FRAME_POINTER; > + else if (TARGET_OMIT_LEAF_FRAME_POINTER_P (opts->x_target_flags)) > +opts->x_flag_omit_frame_pointer = 1; > } > > /* Implement part of TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook. */ > @@ -2590,12 +2596,6 @@ ix86_option_override_internal (bool main_args_p, > opts->x_target_flags |= MASK_NO_RED_ZONE; > } > > - /* Keep nonleaf frame pointers. */ > - if (opts->x_flag_omit_frame_pointer) > -opts->x_target_flags &= ~MASK_OMIT_LEAF_FRAME_POINTER; > - else if (TARGET_OMIT_LEAF_FRAME_POINTER_P (opts->x_target_flags)) > -opts->x_flag_omit_frame_pointer = 1; > - >/* If we're doing fast math, we don't care about comparison order > wrt NaNs. This lets us use a shorter comparison sequence. */ >if (opts->x_flag_finite_math_only) > > > -- > Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/ >Free Software Activist GNU Toolchain Engineer > More tolerance and less prejudice are key for inclusion and diversity > Excluding neuro-others for not behaving ""normal"" is *not* inclusive -- BR, Hongtao
Re: [PATCH] [APX NF] Add a pass to convert legacy insn to NF insns
On Wed, Jul 10, 2024 at 2:46 PM Hongyu Wang wrote: > > Hi, > > For APX ccmp, current infrastructure will always generate cstore for > the ccmp flag user, like > > cmpe%rcx, %r8 > ccmpnel %rax, %rbx > seta%dil > add %rcx, %r9 > add %r9, %rdx > testb %dil, %dil > je .L2 > > For such case, the legacy add clobbers FLAGS_REG so there should have > extra cstore to avoid the flag be reset before using it. If the > instructions between flag producer and user are NF insns, the setcc/ > test sequence is not required. > > Add a pass to convert legacy flag clobber insns to their NF counterpart. > The convertion only happens when > 1. APX_NF enabled. > 2. For a BB, cstore was find, and there are insns between such cstore > and next explicit set insn to FLAGS_REG (test or cmp). > 3. All the insns between should have NF counterpart. > > The pass was added after rtl-ifcvt which eliminates some branch when > profitable, which could cause some flag-clobbering insn put between > cstore and jcc. > > Bootstrapped & regtested on x86_64-pc-linux-gnu and SDE. Also passed > spec2017 simulation run on SDE. > > Ok for trunk? Ok. > > gcc/ChangeLog: > > * config/i386/i386.md (has_nf): New define_attr, add to all > nf related patterns. > * config/i386/i386-features.cc (apx_nf_convert): New function > to convert Non-NF insns to their NF counterparts. > (class pass_apx_nf_convert): New pass class. > (make_pass_apx_nf_convert): New. > * config/i386/i386-passes.def: Add pass_apx_nf_convert after > rtl_ifcvt. > * config/i386/i386-protos.h (make_pass_apx_nf_convert): Declare. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/apx-nf-2.c: New test. > --- > gcc/config/i386/i386-features.cc | 163 +++ > gcc/config/i386/i386-passes.def | 1 + > gcc/config/i386/i386-protos.h| 1 + > gcc/config/i386/i386.md | 67 +- > gcc/testsuite/gcc.target/i386/apx-nf-2.c | 32 + > 5 files changed, 259 insertions(+), 5 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/apx-nf-2.c > > diff --git a/gcc/config/i386/i386-features.cc > b/gcc/config/i386/i386-features.cc > index fc224ed06b0..3da56ddbdcc 100644 > --- a/gcc/config/i386/i386-features.cc > +++ b/gcc/config/i386/i386-features.cc > @@ -3259,6 +3259,169 @@ make_pass_remove_partial_avx_dependency (gcc::context > *ctxt) >return new pass_remove_partial_avx_dependency (ctxt); > } > > +/* Convert legacy instructions that clobbers EFLAGS to APX_NF > + instructions when there are no flag set between a flag > + producer and user. */ > + > +static unsigned int > +ix86_apx_nf_convert (void) > +{ > + timevar_push (TV_MACH_DEP); > + > + basic_block bb; > + rtx_insn *insn; > + hash_map converting_map; > + auto_vec current_convert_list; > + > + bool converting_seq = false; > + rtx cc = gen_rtx_REG (CCmode, FLAGS_REG); > + > + FOR_EACH_BB_FN (bb, cfun) > +{ > + /* Reset conversion for each bb. */ > + converting_seq = false; > + FOR_BB_INSNS (bb, insn) > + { > + if (!NONDEBUG_INSN_P (insn)) > + continue; > + > + if (recog_memoized (insn) < 0) > + continue; > + > + /* Convert candidate insns after cstore, which should > +satisify the two conditions: > +1. Is not flag user or producer, only clobbers > +FLAGS_REG. > +2. Have corresponding nf pattern. */ > + > + rtx pat = PATTERN (insn); > + > + /* Starting convertion at first cstorecc. */ > + rtx set = NULL_RTX; > + if (!converting_seq > + && (set = single_set (insn)) > + && ix86_comparison_operator (SET_SRC (set), VOIDmode) > + && reg_overlap_mentioned_p (cc, SET_SRC (set)) > + && !reg_overlap_mentioned_p (cc, SET_DEST (set))) > + { > + converting_seq = true; > + current_convert_list.truncate (0); > + } > + /* Terminate at the next explicit flag set. */ > + else if (reg_set_p (cc, pat) > + && GET_CODE (set_of (cc, pat)) != CLOBBER) > + converting_seq = false; > + > + if (!converting_seq) > + continue; > + > + if (get_attr_has_nf (insn) > + && GET_CODE (pat) == PARALLEL) > + { > + /* Record the insn to candidate map. */ > + current_convert_list.safe_push (insn); > + converting_map.put (insn, pat); > + } > + /* If the insn clobbers flags but has no nf_attr, > +revoke all previous candidates. */ > + else if (!get_attr_has_nf (insn) > + && reg_set_p (cc, pat) > + && GET_CODE (set_of (cc, pat)) == CLOBBER) > + { > + for (auto item : current_conv
Re: [PATCH] AVX512BF16: Do not allow permutation with vcvtne2ps2bf16 [PR115889]
On Mon, Jul 15, 2024 at 10:21 AM Hongyu Wang wrote: > > > Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b? > > We can still deal with BFmode permutation the same way as HFmode, so > the change in ix86_vectorize_vec_perm_const can be preserved. > > Hongtao Liu 于2024年7月15日周一 09:40写道: > > > > On Sat, Jul 13, 2024 at 3:44 PM Hongyu Wang wrote: > > > > > > Hi, > > > > > > According to the instruction spec of AVX512BF16, the convert from float > > > to BF16 is not a simple truncation. It has special handling for > > > denormal/nan, even for normal float it will add an extra bias according > > > to the least significant bit for bf number. This means we cannot use the > > > vcvtne2ps2bf16 for any bf16 vector shuffle. > > > The optimization introduced in r15-1368 adds a specific split to convert > > > HImode permutation with this instruction, so remove it and treat the > > > BFmode permutation same as HFmode. I see, patch LGTM. > > > > > > Bootstrapped & regtested on x86_64-pc-linux-gnu. OK for trunk? > > Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b? > > > > > > gcc/ChangeLog: > > > > > > PR target/115889 > > > * config/i386/predicates.md (vcvtne2ps2bf_parallel): Remove. > > > * config/i386/sse.md (hi_cvt_bf): Remove. > > > (HI_CVT_BF): Likewise. > > > (vpermt2_sepcial_bf16_shuffle_):Likewise. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR target/115889 > > > * gcc.target/i386/vpermt2-special-bf16-shufflue.c: Adjust option > > > and output scan. > > > --- > > > gcc/config/i386/predicates.md | 11 -- > > > gcc/config/i386/sse.md| 35 --- > > > .../i386/vpermt2-special-bf16-shufflue.c | 5 ++- > > > 3 files changed, 2 insertions(+), 49 deletions(-) > > > > > > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md > > > index a894847adaf..5d0bb1e0f54 100644 > > > --- a/gcc/config/i386/predicates.md > > > +++ b/gcc/config/i386/predicates.md > > > @@ -2327,14 +2327,3 @@ (define_predicate "apx_ndd_add_memory_operand" > > > > > >return true; > > > }) > > > - > > > -;; Check that each element is odd and incrementally increasing from 1 > > > -(define_predicate "vcvtne2ps2bf_parallel" > > > - (and (match_code "const_vector") > > > - (match_code "const_int" "a")) > > > -{ > > > - for (int i = 0; i < XVECLEN (op, 0); ++i) > > > -if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1)) > > > - return false; > > > - return true; > > > -}) > > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > > > index b3b4697924b..c134494cd20 100644 > > > --- a/gcc/config/i386/sse.md > > > +++ b/gcc/config/i386/sse.md > > > @@ -31460,38 +31460,3 @@ (define_insn "vpdp_" > > >"TARGET_AVXVNNIINT16" > > >"vpdp\t{%3, %2, %0|%0, %2, %3}" > > > [(set_attr "prefix" "vex")]) > > > - > > > -(define_mode_attr hi_cvt_bf > > > - [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")]) > > > - > > > -(define_mode_attr HI_CVT_BF > > > - [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")]) > > > - > > > -(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_" > > > - [(set (match_operand:VI2_AVX512F 0 "register_operand") > > > - (unspec:VI2_AVX512F > > > - [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel") > > > - (match_operand:VI2_AVX512F 2 "register_operand") > > > - (match_operand:VI2_AVX512F 3 "nonimmediate_operand")] > > > - UNSPEC_VPERMT2))] > > > - "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()" > > > - "#" > > > - "&& 1" > > > - [(const_int 0)] > > > -{ > > > - rtx op0 = gen_reg_rtx (mode); > > > - operands[2] = lowpart_subreg (mode, > > > - force_reg (mode, operands[2]), > > > - mode); > > > - operands[3] = lowpart_subreg (mode, > > &g
Re: [PATCH] AVX512BF16: Do not allow permutation with vcvtne2ps2bf16 [PR115889]
On Sat, Jul 13, 2024 at 3:44 PM Hongyu Wang wrote: > > Hi, > > According to the instruction spec of AVX512BF16, the convert from float > to BF16 is not a simple truncation. It has special handling for > denormal/nan, even for normal float it will add an extra bias according > to the least significant bit for bf number. This means we cannot use the > vcvtne2ps2bf16 for any bf16 vector shuffle. > The optimization introduced in r15-1368 adds a specific split to convert > HImode permutation with this instruction, so remove it and treat the > BFmode permutation same as HFmode. > > Bootstrapped & regtested on x86_64-pc-linux-gnu. OK for trunk? Could you just git revert 6d0b7b69d143025f271d0041cfa29cf26e6c343b? > > gcc/ChangeLog: > > PR target/115889 > * config/i386/predicates.md (vcvtne2ps2bf_parallel): Remove. > * config/i386/sse.md (hi_cvt_bf): Remove. > (HI_CVT_BF): Likewise. > (vpermt2_sepcial_bf16_shuffle_):Likewise. > > gcc/testsuite/ChangeLog: > > PR target/115889 > * gcc.target/i386/vpermt2-special-bf16-shufflue.c: Adjust option > and output scan. > --- > gcc/config/i386/predicates.md | 11 -- > gcc/config/i386/sse.md| 35 --- > .../i386/vpermt2-special-bf16-shufflue.c | 5 ++- > 3 files changed, 2 insertions(+), 49 deletions(-) > > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md > index a894847adaf..5d0bb1e0f54 100644 > --- a/gcc/config/i386/predicates.md > +++ b/gcc/config/i386/predicates.md > @@ -2327,14 +2327,3 @@ (define_predicate "apx_ndd_add_memory_operand" > >return true; > }) > - > -;; Check that each element is odd and incrementally increasing from 1 > -(define_predicate "vcvtne2ps2bf_parallel" > - (and (match_code "const_vector") > - (match_code "const_int" "a")) > -{ > - for (int i = 0; i < XVECLEN (op, 0); ++i) > -if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1)) > - return false; > - return true; > -}) > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index b3b4697924b..c134494cd20 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -31460,38 +31460,3 @@ (define_insn "vpdp_" >"TARGET_AVXVNNIINT16" >"vpdp\t{%3, %2, %0|%0, %2, %3}" > [(set_attr "prefix" "vex")]) > - > -(define_mode_attr hi_cvt_bf > - [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")]) > - > -(define_mode_attr HI_CVT_BF > - [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")]) > - > -(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_" > - [(set (match_operand:VI2_AVX512F 0 "register_operand") > - (unspec:VI2_AVX512F > - [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel") > - (match_operand:VI2_AVX512F 2 "register_operand") > - (match_operand:VI2_AVX512F 3 "nonimmediate_operand")] > - UNSPEC_VPERMT2))] > - "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()" > - "#" > - "&& 1" > - [(const_int 0)] > -{ > - rtx op0 = gen_reg_rtx (mode); > - operands[2] = lowpart_subreg (mode, > - force_reg (mode, operands[2]), > - mode); > - operands[3] = lowpart_subreg (mode, > - force_reg (mode, operands[3]), > - mode); > - > - emit_insn (gen_avx512f_cvtne2ps2bf16_(op0, > - operands[3], > - operands[2])); > - emit_move_insn (operands[0], lowpart_subreg (mode, op0, > - mode)); > - DONE; > -} > -[(set_attr "mode" "")]) > diff --git a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c > b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c > index 5c65f2a9884..4cbc85735de 100755 > --- a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c > +++ b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c > @@ -1,7 +1,6 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -mavx512bf16 -mavx512vl" } */ > -/* { dg-final { scan-assembler-not "vpermi2b" } } */ > -/* { dg-final { scan-assembler-times "vcvtne2ps2bf16" 3 } } */ > +/* { dg-options "-O2 -mavx512vbmi -mavx512vl" } */ > +/* { dg-final { scan-assembler-times "vpermi2w" 3 } } */ > > typedef __bf16 v8bf __attribute__((vector_size(16))); > typedef __bf16 v16bf __attribute__((vector_size(32))); > -- > 2.34.1 > -- BR, Hongtao
Re: [x86 SSE PATCH] Some AVX512 ternlog expansion refinements (take #2)
On Fri, Jul 12, 2024 at 5:33 AM Roger Sayle wrote: > > > Hi Hongtao, > Thanks for the review and pointing out the remaining uses of force_reg > that I'd overlooked. Here's a revised version of the patch that incorporates > your feedback. One minor change was that rather than using > memory_operand, which as you point out also needs to include > bcst_mem_operand, it's simpler to invert the logic to check for > register_operand [i.e. the first operand must be a register]. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? Ok. > > > 2024-07-11 Roger Sayle > Hongtao Liu > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_broadcast_from_constant): > Use CONST_VECTOR_P instead of comparison against GET_CODE. > (ix86_gen_bcst_mem): Likewise. > (ix86_ternlog_leaf_p): Likewise. > (ix86_ternlog_operand_p): ix86_ternlog_leaf_p is always true for > vector_all_ones_operand. > (ix86_expand_ternlog_bin_op): Use CONST_VECTOR_P instead of > equality comparison against GET_CODE. Replace call to force_reg > with gen_reg_rtx and emit_move_insn (for VEC_DUPLICATE broadcast). > Check for !register_operand instead of memory_operand. > Support CONST_VECTORs by calling force_const_mem. > (ix86_expand_ternlog): Fix indentation whitespace. > Allow ix86_ternlog_leaf_p as ix86_expand_ternlog_andnot's second > operand. Use CONST_VECTOR_P instead of equality against GET_CODE. > Use gen_reg_rtx and emit_move_insn for ~a, ~b and ~c cases. > > > Thanks again, > Roger > > > -Original Message- > > From: Hongtao Liu > > Sent: 08 July 2024 02:55 > > To: Roger Sayle > > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak > > Subject: Re: [x86 SSE PATCH] Some AVX512 ternlog expansion refinements. > > > > On Sun, Jul 7, 2024 at 5:00 PM Roger Sayle > > wrote: > > > Hi Hongtao, > > > This should address concerns about the remaining use of force_reg. > > > > > 51@@ -25793,15 +25792,20 @@ ix86_expand_ternlog_binop (enum rtx_code > > code, machine_mode mode, > > 52 if (GET_MODE (op1) != mode) > > 53 op1 = gen_lowpart (mode, op1); > > 54 > > 55- if (GET_CODE (op0) == CONST_VECTOR) 56+ if (CONST_VECTOR_P (op0)) > > 57 op0 = validize_mem (force_const_mem (mode, op0)); > > 58- if (GET_CODE (op1) == CONST_VECTOR) 59+ if (CONST_VECTOR_P (op1)) > > 60 op1 = validize_mem (force_const_mem (mode, op1)); > > 61 > > 62 if (memory_operand (op0, mode)) > > 63 { > > 64 if (memory_operand (op1, mode)) > > 65- op0 = force_reg (mode, op0); > > 66+ { > > 67+ /* We can't use force_reg (op0, mode). */ > > 68+ rtx reg = gen_reg_rtx (mode); > > 69+ emit_move_insn (reg, op0); > > 70+ op0 = reg; > > 71+ } > > Shouldn't we handle bcst_mem_operand instead of > > memory_operand(bcst_memory_operand is not a memory_operand)? > > so maybe > > if (memory_operand (op0, mode0) || bcst_mem_operand (op0, mode0) > > if (memory_operand (op1, mode) || bcst_mem_operand (op1, mode0)? > > 72 else > > 73std::swap (op0, op1); > > 74 } > > > > Also there's force_reg in below 3 cases, are there any restrictions to avoid > > bcst_mem_operand into them? > > case 0x0f: /* ~a */ > > case 0x33: /* ~b */ > > case 0x33: /* ~b */ > > .. > > if (!TARGET_64BIT && !register_operand (op2, mode)) > >op2 = force_reg (mode, op2); > > > > -- > > BR, > > Hongtao -- BR, Hongtao