from:"crazylht at gmail dot com"

[Bug target/112532] [14 Regression] ICE: in extract_insn, at recog.cc:2804 (unrecognizable insn: vec_duplicate:V4HI) with -O -msse4 since r14-5388-g2794d510b979be

2023-11-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112532

--- Comment #3 from Hongtao.liu  ---
mine.

gcc-bugs@gcc.gnu.org

2023-11-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112104

--- Comment #5 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #4)
> Fixed via r14-5428-gfd1596f9962569afff6c9298a7c79686c6950bef .

Note, my patch only handles constant tripcount for XOR, but not do the
transformation when tripcount is variable.

[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure

2023-11-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374

--- Comment #12 from Hongtao.liu  ---
> So the testsuite without bootstrap is really unchanged?  We still have a

Yes, no extra regression observed from gcc testsuite(both w/ and w/o 
--with-arch=skylake-avx512 --with-cpu=skylake-avx512 in configure) except for
the one reported in PR112361 which you have already fixed.

[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure

2023-11-14 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374

--- Comment #10 from Hongtao.liu  ---
Below patch can pass bootstrap --with-arch=skylake-avx512
--with-cpu=skylake-avx512, but didn't observe obvious typo/bug in the pattern.

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 9eefe9ed45b..b6423037ad1 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -17760,24 +17760,6 @@ (define_expand "3"
   DONE;
 })

-(define_expand "cond_"
-  [(set (match_operand:VI48_AVX512VL 0 "register_operand")
-   (vec_merge:VI48_AVX512VL
- (any_logic:VI48_AVX512VL
-   (match_operand:VI48_AVX512VL 2 "vector_operand")
-   (match_operand:VI48_AVX512VL 3 "vector_operand"))
- (match_operand:VI48_AVX512VL 4 "nonimm_or_0_operand")
- (match_operand: 1 "register_operand")))]
-  "TARGET_AVX512F"
-{
-  emit_insn (gen_3_mask (operands[0],
-operands[2],
-operands[3],
-operands[4],
-operands[1]));
-  DONE;
-})
-
 (define_expand "3_mask"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand")
(vec_merge:VI48_AVX512VL

[Bug fortran/106402] half preicision is not supported by gfortran(real*2).

2023-11-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106402

--- Comment #3 from Hongtao.liu  ---
(In reply to Thomas Koenig from comment #2)
> It would make sense to have it, I guess.  If somebody has access
> to the relevant hardware, it could also be tested :-)

x86 support _Float16 operations with float instructions for TARGET_SSE2 and
above. So for preliminary validation, any processor with sse2 should be enough.

[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.

2023-11-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966

--- Comment #6 from Hongtao.liu  ---
(In reply to Thomas Koenig from comment #5)
> (In reply to Hongtao.liu from comment #4)
> > (In reply to anlauf from comment #3)
> > > (In reply to Hongtao.liu from comment #2)
> > > > (In reply to Richard Biener from comment #1)
> > > > > I think matmul is fine with avx512f or avx, so requiring/using only 
> > > > > the base
> > > > > ISA level sounds fine to me.
> > > > 
> > > > Could be potential miss-optimization.
> > > 
> > > Do you mean a missed optimzation?
> > > 
> > > Or really wrong code?
> > 
> > a missed optimzation.
> 
> Are there benchmarks which show that the code would indeed run
> faster?

Not yet, just better in theory .
But considering that there might be some tweaks regarding x86-64-v4, I think
it's best to leave it unchanged for the time being.

[Bug tree-optimization/112496] [13/14 Regression] ICE: in vectorizable_nonlinear_induction, at tree-vect-loop.cc with bit fields

2023-11-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112496

--- Comment #3 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
>  if (TREE_CODE (init_expr) == INTEGER_CST)
> init_expr = fold_convert (TREE_TYPE (vectype), init_expr);
>   else
> gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype),
>TREE_TYPE (init_expr)));
> 
> and init_expr is a 24 bit integer type while vectype has 32bit components.
> 
> The "fix" is to bail out instead of asserting.

Agree.

[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure

2023-11-09 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374

--- Comment #9 from Hongtao.liu  ---
When I remove all cond_ patterns, it passed bootstrap. continue to
rootcause the exact pattern which cause the bootstrapped failure

[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

2023-11-09 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443

--- Comment #7 from Hongtao.liu  ---
Should be Fixed in GCC14/GCC13/GCC12

[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl

2023-11-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443

--- Comment #1 from Hongtao.liu  ---
The below can fix that, there's typo for 2 splitters.

@@ -17082,7 +17082,7 @@ (define_insn_and_split "*avx2_pcmp3_4"
  (match_dup 4))]
  UNSPEC_BLENDV))]
 {
-  if (INTVAL (operands[5]) == 1)
+  if (INTVAL (operands[5]) == 5)
 std::swap (operands[1], operands[2]);
   operands[3] = gen_lowpart (mode, operands[3]);
 })
@@ -17112,7 +17112,7 @@ (define_insn_and_split "*avx2_pcmp3_5"
  (match_dup 4))]
  UNSPEC_BLENDV))]
 {
-  if (INTVAL (operands[5]) == 1)
+  if (INTVAL (operands[5]) == 5)
 std::swap (operands[1], operands[2]);
 })

[Bug bootstrap/112441] Comparing stages 2 and 3 Bootstrap comparison failure!

2023-11-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112441

Hongtao.liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Hongtao.liu  ---
dup

*** This bug has been marked as a duplicate of bug 112374 ***

[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure

2023-11-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #7 from Hongtao.liu  ---
*** Bug 112441 has been marked as a duplicate of this bug. ***

[Bug bootstrap/112441] New: Comparing stages 2 and 3 Bootstrap comparison failure!

2023-11-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112441

Bug ID: 112441
   Summary: Comparing stages 2 and 3 Bootstrap comparison failure!
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

I meet an bootstrapped compare failure with r14-5243-g80f466aa1cce27

My GCC configure is --with-cpu=native --with-arch=native --disable-libsanitizer
--enable-checking=yes,rtl,extra --enable-clocale

and machine is cascadelake.


make[9]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3'
make[8]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3'
make[7]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3'
make[6]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3'
make[5]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3'
make[4]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3'
make[3]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3'
make[2]: Leaving directory
'/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap'
make "DESTDIR=" "RPATH_ENVVAR=LD_LIBRARY_PATH"
"TARGET_SUBDIR=x86_64-pc-linux-gnu"
"bindir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/bin"
"datadir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share"
"exec_prefix=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap"
"includedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/include"
"datarootdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share"
"docdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/"
"infodir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/info"
"pdfdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/"
"htmldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/"
"libdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/lib"
"libexecdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/libexec"
"lispdir="
"localstatedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/var"
"mandir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/man"
"oldincludedir=/usr/include"
"prefix=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap"
"sbindir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/sbin"
"sharedstatedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/com"
"sysconfdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/etc"
"tooldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu"
"build_tooldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu"
"target_alias=x86_64-pc-linux-gnu" "AWK=gawk" "BISON=bison" "CC_FOR_BUILD=gcc"
"CFLAGS_FOR_BUILD=-g -O2" "CXX_FOR_BUILD=g++ -std=c++11" "EXPECT=expect"
"FLEX=flex" "INSTALL=/usr/bin/install -c" "INSTALL_DATA=/usr/bin/install -c -m
644" "INSTALL_PROGRAM=/usr/bin/install -c" "INSTALL_SCRIPT=/usr/bin/install -c"
"LDFLAGS_FOR_BUILD=" "LEX=flex" "M4=m4" "MAKE=make" "RUNTEST=runtest"
"RUNTESTFLAGS=" "SED=/usr/bin/sed" "SHELL=/bin/sh" "YACC=bison -y" "`echo
'ADAFLAGS=' | sed -e s'/[^=][^=]*=$/XFOO=/'`" "ADA_CFLAGS=" "AR_FLAGS=rc"
"`echo 'BOOT_ADAFLAGS=-gnatpg' | sed -e s'/[^=][^=]*=$/XFOO=/'`"
"BOOT_CFLAGS=-g -O2" "BOOT_LDFLAGS=" "CFLAGS=-g -O2" "CXXFLAGS=-g -O2&q

[Bug target/112393] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1208 with -mavx5124fmaps -Wuninitialized

2023-11-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112393

--- Comment #5 from Hongtao.liu  ---
Fixed.

[Bug target/112393] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1208 with -mavx5124fmaps -Wuninitialized

2023-11-05 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112393

--- Comment #3 from Hongtao.liu  ---
Yes, should return true if d->testing_p instead of generate rtl code.

[Bug rtl-optimization/108707] suboptimal allocation with same memory op for many different instructions.

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108707

Hongtao.liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Hongtao.liu  ---
Fixed in GCC14.

[Bug tree-optimization/102383] Missing optimization for PRE after enable O2 vectorization

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102383

--- Comment #5 from Hongtao.liu  ---
It's fixed in GCC12.1

[Bug target/105034] [11/12/13/14 regression]Suboptimal codegen for min/max with -Os

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105034

Hongtao.liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #8 from Hongtao.liu  ---
Looks like it's fixed in latest trunk.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 101956, which changed state.

Bug 101956 Summary: Miss vectorization from v4hi to v4df
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101956

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/101956] Miss vectorization from v4hi to v4df

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101956

Hongtao.liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao.liu  ---
Fixed by r14-2007-g6f19cf7526168f

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #4 from Hongtao.liu  ---
> So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can
> be merge together with MAX_EXPR >
> 
Create pr112324.

[Bug middle-end/112324] New: phiopt fail to recog if (b < 0) max = MAX(-b, max); else max = MAX (b, max) into max = MAX (ABS(b), max)

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112324

Bug ID: 112324
   Summary: phiopt fail to recog if (b < 0) max = MAX(-b, max);
else max = MAX (b, max) into max = MAX (ABS(b), max)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

#define MAX(a, b) ((a) > (b) ? (a) : (b)) 
int
foo (int n, int* a)
{
int max = 0;
for (int i = 0; i != n; i++)
{
int tmp = a[i];
if (tmp < 0)
  max = MAX (-tmp, max);
else
  max = MAX (tmp, max);
}
return max;
}

int
foo1 (int n, int* a)
{
int max = 0;
for (int i = 0; i != n; i++)
{
int tmp = a[i];
max = MAX ((tmp < 0 ? -tmp : tmp), max);
}
return max;
}

foo should be same as foo1, but gcc failed to recognize ABS_EXPR in foo.
It's from pr110015(originally from source code in openjpeg).

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #3 from Hongtao.liu  ---
169test.c:85:23: note:   vect_is_simple_use: operand max_38 = PHI , type of def: unknown
170test.c:85:23: missed:   Unsupported pattern.
171test.c:62:24: missed:   not vectorized: unsupported use in stmt.
172test.c:85:23: missed:  unexpected pattern.
173test.c:85:23: note:  * Analysis  failed with vector mode V8SI
174test.c:85:23: note:  * The result for vector mode V32QI would be the
same
175test.c:85:23: missed: couldn't vectorize loop
176test.c:65:13: note: vectorized 0 loops in function.
177Removing basic block 5
178;; basic block 5, loop depth 2
179;;  pred:   16
180;;  43
181# max_38 = PHI 
182# i_42 = PHI 
183# datap_44 = PHI 
184tmp_24 = *datap_44;
185_35 = tmp_24 < 0;
186_56 = (unsigned int) tmp_24;
187_51 = -_56;
188_1 = (int) _51;
189_25 = MAX_EXPR <_1, max_38>;
190_31 = _1 | -2147483648;
191iftmp.0_27 = (unsigned int) _31;
192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27);
193_26 = MAX_EXPR ;
194max_5 = _35 ? _25 : _26;
195i_29 = i_42 + 1;
196datap_30 = datap_44 + 4;
197if (w_22 > i_29)
198  goto ; [89.00%]
199else
200  goto ; [11.00%]
201;;  succ:   16

So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be
merge together with MAX_EXPR >

manually change the loop to below, then it can be vectorized.

for (j = 0; j < t1->h; ++j) {
const OPJ_UINT32 w = t1->w;
for (i = 0; i < w; ++i, ++datap) {
OPJ_INT32 tmp = *datap;
if (tmp < 0)
  {
OPJ_UINT32 tmp_unsigned;
tmp_unsigned = opj_to_smr(tmp);
memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32));
tmp = -tmp;
  }
max = opj_int_max(max, tmp);
}
}

maybe it's related to phiopt?

[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5

2023-10-30 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276

--- Comment #8 from Hongtao.liu  ---
Fixed.

gcc-bugs@gcc.gnu.org

2023-10-29 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112104

--- Comment #3 from Hongtao.liu  ---
We already have analyze_and_compute_bitop_with_inv_effect, but it only works
when inv is an SSA_NAME, it should be extended to constant.

[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5

2023-10-29 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276

--- Comment #4 from Hongtao.liu  ---
-(define_split
-  [(set (match_operand:V2HI 0 "register_operand")
-(eq:V2HI
-  (eq:V2HI
-(us_minus:V2HI
-  (match_operand:V2HI 1 "register_operand")
-  (match_operand:V2HI 2 "register_operand"))
-(match_operand:V2HI 3 "const0_operand"))
-  (match_operand:V2HI 4 "const0_operand")))]
-  "TARGET_SSE4_1"
-  [(set (match_dup 0)
-(umin:V2HI (match_dup 1) (match_dup 2)))
-   (set (match_dup 0)
-(eq:V2HI (match_dup 0) (match_dup 2)))])

the splitter is wrong when op1 == op2.(the original pattern returns 0, after
splitter, it returns 1)
So remove the splitter.

[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5

2023-10-29 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---
Mine, I'll take a look.

[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;

2023-10-26 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972

--- Comment #7 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #3)
> First off does this even make sense to vectorize but rather do some kind of
> scalar reduction with respect to j = j^1 here  .  Filed PR 112104 for that.
> 
> Basically vectorizing this loop is a waste compared to that.

Yes, it's always zero, it would be nice if the middle end can optimize the
whole loop off. So for this PR, it's more related to the misoptimization of the
redundant loop(better finalize the induction variable with a simple
assignment), not vectorization.

[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;

2023-10-26 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972

--- Comment #6 from Hongtao.liu  ---
(In reply to Andrew Pinski from comment #5)
> Oh this is the original code:
> https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/whets.c
> 
Yes, it's from unixbench.

[Bug tree-optimization/111833] [13/14 Regression] GCC: 14: hangs on a simple for loop

2023-10-26 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111833

--- Comment #5 from Hongtao.liu  ---
It's the same issue as PR111820, thus should be fixed.

[Bug tree-optimization/111820] [13 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-26 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #15 from Hongtao.liu  ---
(In reply to Richard Biener from comment #13)
> (In reply to Hongtao.liu from comment #12)
> > Fixed in GCC14, not sure if we want to backport the patch.
> > If so, the patch needs to be adjusted since GCC13 doesn't support auto_mpz.
> 
> Yes, we want to backport.

Also fixed in GCC13.

[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;

2023-10-25 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972

Hongtao.liu  changed:

   What|Removed |Added

 CC||pinskia at gcc dot gnu.org
  Component|middle-end  |tree-optimization

--- Comment #1 from Hongtao.liu  ---
The phiopt change is caused by
r14-338-g1dd154f6407658d46faa4d21bfec04fc2551506a

[Bug middle-end/111972] New: [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;

2023-10-25 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972

Bug ID: 111972
   Summary: [14 regression] missed vectorzation for bool a = j !=
1; j = (long int)a;
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

cat test.c

double
foo() {
  long n3 = 345, xtra = 7270;
  long i,ix;
  long j;
  double Check;

  /* Section 3, Conditional jumps */
  j = 0;
  {
for (ix=0; ix2)j = 0;
else   j = 1;
if(j<1)j = 1;
else   j = 0;
  }
  }
  }
  Check = Check + (double)j;
  return Check;
}

The different between gcc 13 dump and gcc14 dump is
GCC13 we have

   [local count: 1063004411]:
  # i_16 = PHI 
  # j_18 = PHI <_7(8), j_21(5)>
  # ivtmp_15 = PHI 
  _7 = j_18 ^ 1;
  i_13 = i_16 + 1;
  ivtmp_6 = ivtmp_15 - 1;
  if (ivtmp_6 != 0)
goto ; [99.00%]
  else
goto ; [1.00%]

GCC14 we have

   [local count: 1063004410]:
  # i_17 = PHI 
  # j_19 = PHI <_14(8), j_22(5)>
  # ivtmp_16 = PHI 
  _9 = j_19 != 1;
  _14 = (long int) _9;
  i_13 = i_17 + 1;
  ivtmp_15 = ivtmp_16 - 1;
  if (ivtmp_15 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

Vectorizer can handle 

  _7 = j_18 ^ 1; 

but not

  _9 = j_19 != 1;
  _14 = (long int) _9;


../test.C:11:18: note:   vect_is_simple_use: operand j_19 != 1, type of def:
internal
../test.C:11:18: note:   mark relevant 2, live 0: _9 = j_19 != 1;
../test.C:11:18: note:   worklist: examine stmt: _9 = j_19 != 1;
../test.C:11:18: note:   vect_is_simple_use: operand j_19 = PHI <_14(8),
j_22(5)>, type of def: unknown
../test.C:11:18: missed:   Unsupported pattern.
../test.C:15:6: missed:   not vectorized: unsupported use in stmt.
../test.C:11:18: missed:  unexpected pattern.


The difference comes from phiopt2.

[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-10-23 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #3 from Hongtao.liu  ---
> For the case of conditional (or loop masked) fold-left reductions the scalar
> fallback isn't implemented.  But AVX512 has vpcompress that could be used
> to implement a more efficient sequence for a masked fold-left, possibly
> using a loop and population count of the mask.
There's extra kmov + vpcompress + popcnt, I'm afraid the performance could be 
 worse than the scalar version.

[Bug target/111889] [14 Regression] 128/256 intrins could not be used with only specifying "no-evex512, avx512vl" in function attribute

2023-10-22 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111889

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #4 from Hongtao.liu  ---
Maybe we should disable "no-vex512" for target attribute, only support option
for them

[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-22 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #12 from Hongtao.liu  ---
Fixed in GCC14, not sure if we want to backport the patch.
If so, the patch needs to be adjusted since GCC13 doesn't support auto_mpz.

[Bug target/111874] Missed mask_fold_left_plus with AVX512

2023-10-19 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #1 from Hongtao.liu  ---
For integer, We have _mm512_mask_reduce_add_epi32 defined as

extern __inline int
__attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
_mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
{
  __A = _mm512_maskz_mov_epi32 (__U, __A);
  __MM512_REDUCE_OP (+);
}

#undef __MM512_REDUCE_OP
#define __MM512_REDUCE_OP(op) \
  __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);\
  __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);\
  __m256i __T3 = (__m256i) (__T1 op __T2);  \
  __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);\
  __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);\
  __v4si __T6 = __T4 op __T5;   \
  __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });  \
  __v4si __T8 = __T6 op __T7;   \
  return __T8[0] op __T8[1]

There's correponding floating point version, but it's not in-order adds.

[Bug tree-optimization/111859] 521.wrf_r build failure with -O2 -march=cascadelake --param vect-partial-vector-usage=2

2023-10-18 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111859

--- Comment #1 from Hongtao.liu  ---
Could be reproduced with:
tar zxvf 521.tar.gz
cd 521
gfortran module_advect_em.fppizedi.f90 -S -O2 -march=cascadelake --param
vect-partial-vector-usage=2 -std=legacy -fconvert=big-endian

[Bug tree-optimization/111859] New: 521.wrf_r build failure with -O2 -march=cascadelake --param vect-partial-vector-usage=2

2023-10-18 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111859

Bug ID: 111859
   Summary: 521.wrf_r build failure with -O2 -march=cascadelake
--param vect-partial-vector-usage=2
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: x86_64-*-* i?86-*-*

Created attachment 56136
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56136&action=edit
reproduce source code

internal compiler error: in get_vectype_for_scalar_type, at
tree-vect-stmts.cc:13153  
 20xe07a6d get_vectype_for_scalar_type(vec_info*, tree_node*, unsigned int) 
 3../gcc/tree-vect-stmts.cc:13153   
 40x277afe1 get_mask_type_for_scalar_type(vec_info*, tree_node*, unsigned int)  
 5../gcc/tree-vect-stmts.cc:13223   
 60x277afe1 vect_check_scalar_mask  
 7../gcc/tree-vect-stmts.cc:2450
 80x277b584 vectorizable_call   
 9../gcc/tree-vect-stmts.cc:3480
100x278ceaf vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*,
_slp_instance*, vec*) 
11../gcc/tree-vect-stmts.cc:12785   
120x18b0430 vect_slp_analyze_node_operations_1  
13../gcc/tree-vect-slp.cc:6066  
140x18b0430 vect_slp_analyze_node_operations
15../gcc/tree-vect-slp.cc:6265  
160x18b0364 vect_slp_analyze_node_operations
17../gcc/tree-vect-slp.cc:6244  
180x18b20fb vect_slp_analyze_operations(vec_info*)  
19../gcc/tree-vect-slp.cc:6516  
200x18b8792 vect_slp_analyze_bb_1   
21../gcc/tree-vect-slp.cc:7520  
220x18b8792 vect_slp_region 
23../gcc/tree-vect-slp.cc:7567  
240x18ba7e9 vect_slp_bbs
25../gcc/tree-vect-slp.cc:7775  
260x18bab5b vect_slp_function(function*)
27../gcc/tree-vect-slp.cc:7854  
280x18c4ee1 execute 
29../gcc/tree-vectorizer.cc:1529
30Please submit a full bug report, with preprocessed source (by using
-freport-bug).
31Please include the complete backtrace with any bug report.
32See <https://gcc.gnu.org/bugs/> for instructions.

[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-17 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #9 from Hongtao.liu  ---

> But we end up here with niters_skip being INTEGER_CST and ..
> 
> > 1421  || (!vect_use_loop_mask_for_alignment_p (loop_vinfo)
> 
> possibly vect_use_loop_mask_for_alignment_p.  Note
> LOOP_VINFO_PEELING_FOR_ALIGNMENT < 0 simply means the amount of
> peeling is unknown.
> 
> But I wonder how we run into this on x86 without enabling
> loop masking ...
> 
> > 1422  && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0))
> > 1423{
> > 1424  if (dump_enabled_p ())
> > 1425dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > 1426 "Peeling for alignement is not supported"
> > 1427 " for nonlinear induction when niters_skip"
> > 1428 " is not constant.\n");
> > 1429  return false;
> > 1430}

Can you point out where it's assigned as nagative?
I saw LOOP_VINFO_MASK_SKIP_NITERS is only assigned in
vect_prepare_for_masked_peels.

when LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0
it's assigned as vf-npeel(will npeel > vf?)
else
it's assigned in get_misalign_in_elems and should be positive.

  HOST_WIDE_INT elem_size
= int_cst_value (TYPE_SIZE_UNIT (TREE_TYPE (vectype)));
  tree elem_size_log = build_int_cst (type, exact_log2 (elem_size));

  /* Create:  misalign_in_bytes = addr & (target_align - 1).  */
  tree int_start_addr = fold_convert (type, start_addr);
  tree misalign_in_bytes = fold_build2 (BIT_AND_EXPR, type, int_start_addr,
target_align_minus_1);

  /* Create:  misalign_in_elems = misalign_in_bytes / element_size.  */
  tree misalign_in_elems = fold_build2 (RSHIFT_EXPR, type, misalign_in_bytes,
elem_size_log);

  return misalign_in_elems;

void
vect_prepare_for_masked_peels (loop_vec_info loop_vinfo)
{
  tree misalign_in_elems;
  tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));

  gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo));

  /* From the information recorded in LOOP_VINFO get the number of iterations
 that need to be skipped via masking.  */
  if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0)
{
  poly_int64 misalign = (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
 - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo));
  misalign_in_elems = build_int_cst (type, misalign);
}
  else
{
  gimple_seq seq1 = NULL, seq2 = NULL;
  misalign_in_elems = get_misalign_in_elems (&seq1, loop_vinfo);
  misalign_in_elems = fold_convert (type, misalign_in_elems);
  misalign_in_elems = force_gimple_operand (misalign_in_elems,
&seq2, true, NULL_TREE);
  gimple_seq_add_seq (&seq1, seq2);
  if (seq1)
{
  edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo));
  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq1);
  gcc_assert (!new_bb);
}
}

  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "misalignment for fully-masked loop: %T\n",
 misalign_in_elems);

  LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo) = misalign_in_elems;

  vect_update_inits_of_drs (loop_vinfo, misalign_in_elems, MINUS_EXPR);
}

[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-16 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #7 from Hongtao.liu  ---
(In reply to rguent...@suse.de from comment #6)
> On Mon, 16 Oct 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820
> > 
> > --- Comment #5 from Hongtao.liu  ---
> > (In reply to Richard Biener from comment #3)
> > > for (unsigned i = 0; i != skipn - 1; i++)
> > >   begin = wi::mul (begin, wi::to_wide (step_expr));
> > > 
> > > (gdb) p skipn
> > > $5 = 4294967292
> > > 
> > > niters is 4294967292 in vect_update_ivs_after_vectorizer.  Maybe the loop
> > > should terminate when begin is zero.  But I wonder why we pass in 'niters'
> > Here, it want to calculate begin * pow (step_expr, skipn), yes we can just 
> > skip
> > the loop when begin is 0.
> 
> I mean terminate it when the multiplication overflowed to zero.
for pow (3, skipn), it will never overflowed to zero.
To solve this problem once and for all, I'm leaning towards setting a threshold
in vect_can_peel_nonlinear_iv_p for vect_step_op_mul,if step_expr is not
exact_log2() and niter > TYPE_PRECISION (step_expr) we give up on doing
vectorization.
> 
> As for the MASK_ thing the skip is to be interpreted negative (we
> should either not use a 'tree' here or make it have the correct type
> maybe).  Can we even handle this here?  It would need to be
> a division, no?
> 
> So I think we need to disable non-linear IV or masked peeling for
> niter/aligment?  But I wonder how we run into this with plain -O3.
I think we already disabled negative niters_skip in
vect_can_peel_nonlinear_iv_p.

416  /* Also doens't support peel for neg when niter is variable.
1417 ??? generate something like niter_expr & 1 ? init_expr : -init_expr? 
*/
1418  niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
1419  if ((niters_skip != NULL_TREE
1420   && TREE_CODE (niters_skip) != INTEGER_CST)
1421  || (!vect_use_loop_mask_for_alignment_p (loop_vinfo)
1422  && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0))
1423{
1424  if (dump_enabled_p ())
1425dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
1426 "Peeling for alignement is not supported"
1427 " for nonlinear induction when niters_skip"
1428 " is not constant.\n");
1429  return false;
1430}

[Bug target/111829] Redudant register moves inside the loop

2023-10-16 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829

--- Comment #4 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> You sink the conversion, so it would be PRE on the reverse graph.  The
> transform doesn't really fit a particular pass I think.
The conversions also needs to be hoisted if the initial variable is not
constant v2di{0, 0}/v4si{0, 0, 0, 0}

[Bug target/111829] Redudant register moves inside the loop

2023-10-16 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829

--- Comment #3 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> You sink the conversion, so it would be PRE on the reverse graph.  The
> transform doesn't really fit a particular pass I think.
> 
> Why does the problem persist in RTL?
Normally, combine will eliminate the redudant move by combine subreg to the
pattern like.

1004(insn 19 17 21 3 (set (subreg:V4SI (reg/v:V2DI 103 [ vsum ]) 0)
1005(unspec:V4SI [
1006(subreg:V4SI (reg/v:V2DI 103 [ vsum ]) 0)
1007(reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ])
1008(reg:V4SI 124)
1009] UNSPEC_VPDPBUSD)) "test.c":9:16 discrim 1 9182
{vpdpbusd_v4si}

but for this case, before combine, cse1/fwprop propagate the subreg(insn 21)
from inner loop to outside(insn 28), since there's use for (reg:V4SI 121),
combine failed to eliminate the redudnat mov of subreg.

--loop_begin--
...
(insn 19 18 20 3 (set (reg:V4SI 121)
393(unspec:V4SI [
394(reg:V4SI 122 [ vsum ])
395(reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ])
396(reg:V4SI 124)
397] UNSPEC_VPDPBUSD)) "test.c":9:16 discrim 1 9182 {vpdpbusd_v4si}
398 (expr_list:REG_DEAD (reg:V4SI 125)
399(expr_list:REG_DEAD (reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ])
400(expr_list:REG_DEAD (reg:V4SI 122 [ vsum ])
401(nil)
402(insn 20 19 21 3 (set (reg:V4SI 102 [ _11 ])
403(reg:V4SI 121)) "test.c":9:16 discrim 1 1906 {movv4si_internal}
404 (expr_list:REG_DEAD (reg:V4SI 121)
405(nil)))
406(insn 21 20 22 3 (set (reg/v:V2DI 103 [ vsum ])
407(subreg:V2DI (reg:V4SI 121) 0)) "test.c":9:16 discrim 2 1909
{movv2di_internal}
408 (nil))
...
-loop_end-

453(note 27 26 28 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
454(insn 28 27 29 4 (set (mem:V2DI (reg/v/f:DI 119 [ pc ]) [0 *pc_22(D)+0 S16
A128])
455(subreg:V2DI (reg:V4SI 121) 0)) "test.c":11:9 1909
{movv2di_internal}
456 (expr_list:REG_DEAD (reg/v/f:DI 119 [ pc ]) --- propogate from insn 21
457(expr_list:REG_DEAD (reg/v:V2DI 103 [ vsum ])
458(nil

[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-16 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #5 from Hongtao.liu  ---
(In reply to Richard Biener from comment #3)
> for (unsigned i = 0; i != skipn - 1; i++)
>   begin = wi::mul (begin, wi::to_wide (step_expr));
> 
> (gdb) p skipn
> $5 = 4294967292
> 
> niters is 4294967292 in vect_update_ivs_after_vectorizer.  Maybe the loop
> should terminate when begin is zero.  But I wonder why we pass in 'niters'
Here, it want to calculate begin * pow (step_expr, skipn), yes we can just skip
the loop when begin is 0.
Also optimize the loop to shift when step_expr is power of 2.
But for other cases, the loop is still needed.

[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`

2023-10-16 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820

--- Comment #4 from Hongtao.liu  ---
> niters is 4294967292 in vect_update_ivs_after_vectorizer.  Maybe the loop
> should terminate when begin is zero.  But I wonder why we pass in 'niters'
> and then name it 'skip_niters' ...
>

It's coming from here

 9448  niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
 9449  /* If we are using the loop mask to "peel" for alignment then we need
 9450 to adjust the start value here.  */
 9451  if (niters_skip != NULL_TREE)
 9452init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr,
niters_skip,
 9453 step_expr, induction_type);
 9454

[Bug target/111829] Redudant register moves inside the loop

2023-10-15 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829

--- Comment #1 from Hongtao.liu  ---
  ivtmp.23_31 = (unsigned long) b_24(D);
  ivtmp.24_46 = (unsigned long) pa_26(D);
  _50 = ivtmp.23_31 + 40;

   [local count: 1063004408]:
  # vsum_35 = PHI 
  # ivtmp.23_14 = PHI 
  # ivtmp.24_30 = PHI 
  _47 = (void *) ivtmp.23_14;
  _4 = MEM[(int *)_47];
  _25 = {_4, _4, _4, _4};
  _48 = (void *) ivtmp.24_30;
  _7 = MEM[(__m128i * {ref-all})_48];
  _8 = VIEW_CONVERT_EXPR<__v4si>(_7);
  _9 = VIEW_CONVERT_EXPR<__v4si>(vsum_35);
  _27 = __builtin_ia32_vpdpbusd_v4si (_9, _8, _25);
  vsum_28 = VIEW_CONVERT_EXPR<__m128i>(_27);
  ivtmp.23_15 = ivtmp.23_14 + 4;
  ivtmp.24_45 = ivtmp.24_30 + 16;
  if (ivtmp.23_15 != _50)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 10737416]:
  *pc_19(D) = vsum_28;
  ivtmp.15_34 = (unsigned long) &vsum.0;
  _13 = ivtmp.15_34 + 16;

   [local count: 42949663]:
  # ssum_38 = PHI 
  # ivtmp.15_33 = PHI 

I'm curious if we can "move" VIEW_EXPR_CONVERT outside of the loop as below

   [local count: 1063004408]:
-  # vsum_35 = PHI 
+  # _9 = PHI <_27(3), { 0, 0, 0, 0}(2)>
  # ivtmp.23_14 = PHI 
  # ivtmp.24_30 = PHI 
  _47 = (void *) ivtmp.23_14;
  _4 = MEM[(int *)_47];
  _25 = {_4, _4, _4, _4};
  _48 = (void *) ivtmp.24_30;
  _7 = MEM[(__m128i * {ref-all})_48];
  _8 = VIEW_CONVERT_EXPR<__v4si>(_7);
-  _9 = VIEW_CONVERT_EXPR<__v4si>(vsum_35);
  _27 = __builtin_ia32_vpdpbusd_v4si (_9, _8, _25);
-  vsum_28 = VIEW_CONVERT_EXPR<__m128i>(_27);
  ivtmp.23_15 = ivtmp.23_14 + 4;
  ivtmp.24_45 = ivtmp.24_30 + 16;
  if (ivtmp.23_15 != _50)
goto ; [98.99%]
  else
goto ; [1.01%]

   [local count: 10737416]:
+  vsum_28 = VIEW_CONVERT_EXPR <_27>
  *pc_19(D) = vsum_28;
  ivtmp.15_34 = (unsigned long) &vsum.0;
  _13 = ivtmp.15_34 + 16;

   [local count: 42949663]:
  # ssum_38 = PHI 
  # ivtmp.15_33 = PHI 


It looks like an lazy code motion optimization, but currently not handled by
PRE.

[Bug target/111829] New: Redudant register moves inside the loop

2023-10-15 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829

Bug ID: 111829
   Summary: Redudant register moves inside the loop
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---
Target: x86_64-*-* i?86-*-*

#include
int
foo (__m128i* __restrict pa, int* b,
 __m128i* __restrict pc, int n)
{
__m128i vsum = _mm_setzero_si128();
for (int i = 0; i != 10; i++)
{
vsum = _mm_dpbusd_epi32 (vsum, pa[i], _mm_set1_epi32 (b[i]));
}
*pc = vsum;
int ssum = 0;
for (int i = 0; i != 4; i++)
  ssum += ((__v4si)vsum)[i];
return ssum;
}

gcc -O2 -mavxvnni

foo(long long __vector(2)*, int*, long long __vector(2)*, int):
leaq40(%rsi), %rax
vpxor   %xmm0, %xmm0, %xmm0
.L2:
vmovdqa (%rdi), %xmm2
vmovdqa %xmm0, %xmm1  redundant
addq$4, %rsi
addq$16, %rdi
vpbroadcastd-4(%rsi), %xmm3
{vex} vpdpbusd  %xmm3, %xmm2, %xmm1
vmovdqa %xmm1, %xmm0 --- redundant
cmpq%rax, %rsi
jne .L2
vmovdqa %xmm1, (%rdx)
leaq-24(%rsp), %rax
leaq-8(%rsp), %rcx
xorl%edx, %edx
.L3:
vmovdqa %xmm0, -24(%rsp)
addq$4, %rax
addl-4(%rax), %edx
cmpq%rax, %rcx
jne .L3
movl%edx, %eax
ret


it can be better with


foo(long long __vector(2)*, int*, long long __vector(2)*, int):
leaq40(%rsi), %rax
vpxor   %xmm0, %xmm0, %xmm0
.L2:
vmovdqa (%rdi), %xmm2

addq$4, %rsi
addq$16, %rdi
vpbroadcastd-4(%rsi), %xmm3
{vex} vpdpbusd  %xmm3, %xmm2, %xmm0
cmpq%rax, %rsi
jne .L2
vmovdqa %xmm0, (%rdx)
leaq-24(%rsp), %rax
leaq-8(%rsp), %rcx
xorl%edx, %edx
.L3:
vmovdqa %xmm0, -24(%rsp)
addq$4, %rax
addl-4(%rax), %edx
cmpq%rax, %rcx
jne .L3
movl%edx, %eax
ret

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #10 from Hongtao.liu  ---
> indeed (but I believe it did happen with Alder Lake already, by accident,
> with AVX512 on P-cores but not on E-cores).

AVX512 is physically fused off for Alderlake P-core, P-core and E-core share
the same ISA level(AVX2).

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-11 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #4 from Hongtao.liu  ---
I checked Alderlake's L1 cachesize and it is indeed 48, and L1 cachesize in
alderlake_cost is set to 32.
But then again, we have a lot of different platforms that share the same cost 
and they may have different L1 cachesizes, but from a micro-architecture tuning
point of view, it doesn't make a difference. A separate cost if only the L1
cachesize is different is quite unnecessary（the size itself is just a parameter
for the software prefetch, it doesn't have to be real hardware cachesize)

[Bug target/111745] [14 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) with -ffloat-store -mavx512fp16 -mavx512vl

2023-10-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111745

--- Comment #3 from Hongtao.liu  ---
Fixed.

[Bug target/104610] memcmp () == 0 can be optimized better for avx512f

2023-10-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610

--- Comment #22 from Hongtao.liu  ---
For 64-byte memory comparison

int compare (const char* s1, const char* s2)
{
  return __builtin_memcmp (s1, s2, 64) == 0;
}

We're generating

vmovdqu (%rsi), %ymm0
vpxorq  (%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
jne .L2
vmovdqu 32(%rsi), %ymm0
vpxorq  32(%rdi), %ymm0, %ymm0
vptest  %ymm0, %ymm0
je  .L5
.L2:
movl$1, %eax
xorl$1, %eax
vzeroupper
ret

An alternative way is using vpcmpeq + kortest and check Carry bit

vmovdqu64   (%rsi), %zmm0
xorl%eax, %eax
vpcmpeqd(%rdi), %zmm0, %k0
kortestw%k0, %k0
setc%al
vzeroupper

Not sure if it's better or not.

[Bug target/111745] [14 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) with -ffloat-store -mavx512fp16 -mavx512vl

2023-10-09 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111745

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
Mine, I'll take a look.

[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2023-10-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731

--- Comment #2 from Hongtao.liu  ---
The original project is too complex for me to come up with a reproduction case,
I can help with gdb if additional information is needed.

[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2023-10-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731

--- Comment #1 from Hongtao.liu  ---
GCC11.3 is ok, GCC13.2 and later have the issue, I didn't verify GCC12.

[Bug libgcc/111731] New: [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2023-10-08 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731

Bug ID: 111731
   Summary: [13/14 regression] gcc_assert is hit at
libgcc/unwind-dw2-fde.c#L291
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgcc
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

The issue is not solved by PR110956'fix.

I did some debugging with gdb, and here are the logs:

The first time gdb stop at
https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c#L143

│   138   ob->next = unseen_objects;
│   139   unseen_objects = ob;  
│   140 
│   141   __gthread_mutex_unlock (&object_mutex);   
│   142 #endif  
│  >143 }

(gdb) frame
#0  __register_frame_info_bases (begin=0x7fffd551e000, ob=0x1e386d0, tbase=0x0,
dbase=0x0) at ../../../libgcc/unwind-dw2-fde.c:143
(gdb) p registered_frames->root->entry_count
$31 = 2
(gdb) p registered_frames->root->content.entries[0]
$32 = {base = 140736772300800, size = 1, ob = 0x1e386d0}
(gdb) p registered_frames->root->content.entries[1]
$33 = {base = 140736772317184, size = 178483158, ob = 0x1e386d0}

The second time gdb stop at
https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c#L143

│   138   ob->next = unseen_objects;
│   139   unseen_objects = ob;  
│   140 
│   141   __gthread_mutex_unlock (&object_mutex);   
│   142 #endif  
│  >143 }

(gdb) frame
#0  __register_frame_info_bases (begin=0x7fffd409c000, ob=0x26b2e00, tbase=0x0,
dbase=0x0) at ../../../libgcc/unwind-dw2-fde.c:143
(gdb) p registered_frames->root->entry_count
$34 = 4
(gdb) p registered_frames->root->content.entries[0]
$35 = {base = 140736750796800, size = 1, ob = 0x26b2e00}
(gdb) p registered_frames->root->content.entries[1]
$36 = {base = 140736750817280, size = 199987168, ob = 0x26b2e00}
(gdb) p registered_frames->root->content.entries[2]
$37 = {base = 140736772300800, size = 1, ob = 0x1e386d0}
(gdb) p registered_frames->root->content.entries[3]
$38 = {base = 140736772317184, size = 178483158, ob = 0x1e386d0}

The first time gdb stop at unexpected line
https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-btree.h#L829:

│   825   unsigned slot = btree_node_find_leaf_slot (iter, base);   
│   826   if ((slot >= iter->entry_count) ||
(iter->content.entries[slot].base != base)) 
│   827 {   
│   828   // Not found, this should never happen.   
│  >829   btree_node_unlock_exclusive (iter);   
│   830   return NULL;  
│   831 } 

(gdb) p slot
$26 = 1
(gdb) p iter->content.entries[slot]
$27 = {base = 140736750817280, size = 199987168, ob = 0x26e7900}
(gdb) p iter->content.entries[2]
$28 = {base = 140736772300800, size = 1, ob = 0x1e386d0}
We can see that when we try to remove btree node of
0x7fffd551e000(140736772300800).

 The return value of btree_node_find_leaf_slot is 1, but I think it should
return 2. 


Both btree_insert and btree_remove will call

// Find the position for a slot in a leaf node.
static unsigned
btree_node_find_leaf_slot (const struct btree_node *n, uintptr_type value)
{
  for (unsigned index = 0, ec = n->entry_count; index != ec; ++index)
   if (n->content.entries[index].base + n->content.entries[index].size > value) 
 return index;
  return n->entry_count;
} 


But

registered_frames->root->content.entries[1].base +
registered_frames->root->content.entries[1].size >
registered_frames->root->content.entries[2].base

registered_frames->root->content.entries[2].base +
registered_frames->root->content.entries[2].size >
registered_frames->root->content.entries[1].base 

and it makes btree_node_find_leaf_slot return wrong slot(at btree_insert, it
will return slot 1 for base1, and move base2 to slot2, but at btree_remove, it
still return slot 1 bacause of upper logic), I'm not sure if this is the
rootcause.

[Bug tree-optimization/111402] Loop distribution fail to optimize memmove for multiple consecutive moves within a loop

2023-09-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111402

--- Comment #2 from Hongtao.liu  ---
Adjust code in foo1, use < n instead of != n, the issue remains.

void
foo1 (v4di* __restrict a, v4di *b, int n)
{
  for (int i = 0; i < n; i+=2)
{
a[i] = b[i];
a[i+1] = b[i+1];   
}
}

[Bug middle-end/111402] New: Loop distribution fail to optimize memmove for multiple consecutive moves within a loop

2023-09-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111402

Bug ID: 111402
   Summary: Loop distribution fail to optimize memmove for
multiple consecutive moves within a loop
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

cat test.c

typedef long long v4di __attribute__((vector_size(32)));

void
foo (v4di* __restrict a, v4di *b, int n)
{
  for (int i = 0; i != n; i++)
a[i] = b[i];
}

void
foo1 (v4di* __restrict a, v4di *b, int n)
{
  for (int i = 0; i != n; i+=2)
{
a[i] = b[i];
a[i+1] = b[i+1];   
}
}


gcc -O2 -S test.c

GCC can optimize loop in foo to memmove, but not for loop in foo1.
This is from PR111354

[Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.

2023-09-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #5 from Hongtao.liu  ---
void
rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
{
__m256i ymm0, ymm1, ymm2, ymm3;

while (n >= 128) {
ymm0 = _mm256_loadu_si256((const __m256i *)(const void *)
  ((const uint8_t *)src + 0 * 32));
n -= 128;
ymm1 = _mm256_loadu_si256((const __m256i *)(const void *)
  ((const uint8_t *)src + 1 * 32));
ymm2 = _mm256_loadu_si256((const __m256i *)(const void *)
  ((const uint8_t *)src + 2 * 32));
ymm3 = _mm256_loadu_si256((const __m256i *)(const void *)
  ((const uint8_t *)src + 3 * 32));
src = (const uint8_t *)src + 128;
_mm256_storeu_si256((__m256i *)(void *)
((uint8_t *)dst + 0 * 32), ymm0);
_mm256_storeu_si256((__m256i *)(void *)
((uint8_t *)dst + 1 * 32), ymm1);
_mm256_storeu_si256((__m256i *)(void *)
((uint8_t *)dst + 2 * 32), ymm2);
_mm256_storeu_si256((__m256i *)(void *)
((uint8_t *)dst + 3 * 32), ymm3);
dst = (uint8_t *)dst + 128;
}
}

I'm curious if we can distribute the uppper as an memmove?(of course, compiler
needs to know 2 array don't alias each other.

[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16

2023-09-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306

--- Comment #8 from Hongtao.liu  ---
Fixed in GCC14.1 GCC13.3 GCC12.4

[Bug target/111335] fmaddpch seems not commutative for operands[1] and operands[2] due to precision loss

2023-09-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111335

Hongtao.liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao.liu  ---
Fixed in GCC14.1 GCC13.3 GCC12.4

[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306

--- Comment #4 from Hongtao.liu  ---
A related PR111335 for fmaddcph , similar but not the same, PR111335 is due to
precision difference for complex _Float16 fma, fmaddcph a, b, c is not equal to
fmaddcph b, a, c

[Bug target/111335] New: fmaddpch seems not commutative for operands[1] and operands[2] due to precision loss

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111335

Bug ID: 111335
   Summary: fmaddpch seems not commutative for operands[1] and
operands[2] due to precision loss
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

fmaddcph is complex _Float16 fma.

cat test.c

#include 
#include 

void func(_Float16 a[], _Float16 b[], _Float16 c[])

{
   const __m128h r0 = _mm_loadu_ph(a);
   const __m128h r1 = _mm_loadu_ph(b);
   const __m128h r2 = _mm_loadu_ph(c);
   const __m128h mul = _mm_fmadd_pch(r0, r1, r2);
   printf("%f %f\n", (float)mul[0], (float)mul[1]);
}

int main()

{
  _Float16 a[8] = {-0.7949218f16, +0.2739257f16};
  _Float16 b[8] = {+0.0010070f16, +0.0015659f16};
  _Float16 c[8] = {-0.0010366f16, -0.0018014f16};
  func(a, b, c);
  return 0;
}


g++ -O0 -march=sapphirerapids test.c, we get fmaddpch a, b, c, and the result
is 
-0.002266 -0.002769

g++ -O0 -march=sapphirerapids test.c, we get fmaddpch b, a, c, and the result
is 
-0.002266 -0.002771

[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306

--- Comment #3 from Hongtao.liu  ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629650.html

[Bug target/111333] Runtime failure for fcmulcph instrinsic

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333

--- Comment #2 from Hongtao.liu  ---
The test failed since GCC12 when the pattern is added

[Bug target/111333] Runtime failure for fcmulcph instrinsic

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333

--- Comment #1 from Hongtao.liu  ---
fmulcph/fmaddcph is commutative for operands[1] and operands[2], but
fcmulcph/fcmaddcph is not, since it's Complex conjugate operations.

Below change fixes the issue.

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 6d3ae8dea0c..833546c5228 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -6480,6 +6480,14 @@ (define_int_attr complexpairopname
[(UNSPEC_COMPLEX_FMA_PAIR "fmaddc")
 (UNSPEC_COMPLEX_FCMA_PAIR "fcmaddc")])

+(define_int_attr int_comm
+   [(UNSPEC_COMPLEX_FMA "%")
+(UNSPEC_COMPLEX_FMA_PAIR "%")
+(UNSPEC_COMPLEX_FCMA "")
+(UNSPEC_COMPLEX_FCMA_PAIR "")
+(UNSPEC_COMPLEX_FMUL "%")
+(UNSPEC_COMPLEX_FCMUL "")])
+
 (define_int_attr conj_op
[(UNSPEC_COMPLEX_FMA "")
 (UNSPEC_COMPLEX_FCMA "_conj")
@@ -6593,7 +6601,7 @@ (define_expand "cmla4"
 (define_insn "fma__"
   [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v")
(unspec:VHF_AVX512VL
- [(match_operand:VHF_AVX512VL 1 "" "%v")
+ [(match_operand:VHF_AVX512VL 1 ""
"v")
   (match_operand:VHF_AVX512VL 2 ""
"")
   (match_operand:VHF_AVX512VL 3 "" "0")]
   UNSPEC_COMPLEX_F_C_MA))]
@@ -6658,7 +,7 @@ (define_insn_and_split
"fma___fma_zero"
 (define_insn "fma___pair"
  [(set (match_operand:VF1_AVX512VL 0 "register_operand" "=&v")
(unspec:VF1_AVX512VL
-[(match_operand:VF1_AVX512VL 1 "vector_operand" "%v")
+[(match_operand:VF1_AVX512VL 1 "vector_operand" "v")
  (match_operand:VF1_AVX512VL 2 "bcst_vector_operand" "vmBr")
  (match_operand:VF1_AVX512VL 3 "vector_operand" "0")]
  UNSPEC_COMPLEX_F_C_MA_PAIR))]
@@ -6727,7 +6735,7 @@ (define_insn
"___mask"
   [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v")
(vec_merge:VHF_AVX512VL
  (unspec:VHF_AVX512VL
-   [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "%v")
+   [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand"
"v")
 (match_operand:VHF_AVX512VL 2 "nonimmediate_operand"
"")
 (match_operand:VHF_AVX512VL 3 "register_operand" "0")]
 UNSPEC_COMPLEX_F_C_MA)
@@ -6752,7 +6760,7 @@ (define_expand "cmul3"
 (define_insn "__"
   [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v")
  (unspec:VHF_AVX512VL
-   [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "%v")
+   [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand"
"v")
 (match_operand:VHF_AVX512VL 2 "nonimmediate_operand"
"")]
 UNSPEC_COMPLEX_F_C_MUL))]
   "TARGET_AVX512FP16 && "

[Bug target/111333] New: Runtime failure for fcmulcph instrinsic

2023-09-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333

Bug ID: 111333
   Summary: Runtime failure for fcmulcph instrinsic
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

cat main.cpp

#include 
#include 

__attribute__((optimize("O0")))
auto func0(_Float16 *a, _Float16 *b, int n, _Float16 *c) {
  __m512h rA = _mm512_loadu_ph(a);
  for (int i = 0; i < n; i += 32) {
__m512h rB = _mm512_loadu_ph(b + i);
_mm512_storeu_ph(c + i, _mm512_fcmul_pch(rB, rA));
  }
}

__attribute__((optimize("O2")))
auto func1(_Float16 *a, _Float16 *b, int n, _Float16 *c) {
  __m512h rA = _mm512_loadu_ph(a);
  for (int i = 0; i < n; i += 32) {
__m512h rB = _mm512_loadu_ph(b + i);
_mm512_storeu_ph(c + i, _mm512_fcmul_pch(rB, rA));
  }
}

int main() {
  int n = 32;

  _Float16 a[n], b[n], c[n];
  for (int i = 1; i <= n; i++) {
a[i - 1] = i & 1 ? -i : i;
b[i - 1] = i;
  }
  printf("a = %f + %fi \n", (float)a[0], (float)a[1]);
  printf("b = %f + %fi \n", (float)b[0], (float)b[1]);
  printf("b * conj(a) = %f + %fi \n\n", (float)(a[0]*b[0] + a[1]*b[1]),
(float)(a[0]*b[1] - a[1]*b[0]));

  func0(a, b, n, c);
for (int i = 0; i < n / 32 * 2; i++) {
  printf("%f ", (float)c[i]);
}
printf("\n");

  func1(a, b, n, c);
for (int i = 0; i < n / 32 * 2; i++) {
  printf("%f ", (float)c[i]);
}
printf("\n");

  return 0;
}

g++ -march=sapphirerapids main.cpp -o test
sde -spr-- ./test

a = -1.00 + 2.00i
b = 1.00 + 2.00i
b * conj(a) = 3.00 + -4.00i

3.00 -4.00
3.00 4.00

[Bug target/111225] ICE in curr_insn_transform, unable to generate reloads for xor, since r14-2447-g13c556d6ae84be

2023-08-29 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111225

--- Comment #2 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #1)
> So reload thought CT_SPECIAL_MEMORY is always win for spilled_pesudo_p, but
> here Br should be a vec_dup:mem which doesn't match spilled_pseduo_p.
>  
>   case CT_SPECIAL_MEMORY:
> if (satisfies_memory_constraint_p (op, cn))
>   win = true;
> else if (spilled_pseudo_p (op))
>   win = true;
> break;

vmBr constraint is ok as long as m is matched before Br, but here m in invalid
then exposed the problem.
The backend walkaround is disabling Br when m is not availble.

Or the middle-end fix should be removing win for spilled_pseudo_p (op) in
CT_SPECIAL_MEMORY.

[Bug target/111225] ICE in curr_insn_transform, unable to generate reloads for xor, since r14-2447-g13c556d6ae84be

2023-08-29 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111225

--- Comment #1 from Hongtao.liu  ---
So reload thought CT_SPECIAL_MEMORY is always win for spilled_pesudo_p, but
here Br should be a vec_dup:mem which doesn't match spilled_pseduo_p.

case CT_SPECIAL_MEMORY:
  if (satisfies_memory_constraint_p (op, cn))
win = true;
  else if (spilled_pseudo_p (op))
win = true;
  break;

[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)

2023-08-28 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064

--- Comment #6 from Hongtao.liu  ---

> 
> [liuhongt@intel gather_emulation]$ ./gather.out
> ;./nogather_xmm.out;./nogather_ymm.out
> elapsed time: 1.75997 seconds for gather with 3000 iterations
> elapsed time: 2.42473 seconds for no_gather_xmm with 3000 iterations
> elapsed time: 1.86436 seconds for no_gather_ymm with 3000 iterations
> 


For 510.parest_r, enable gather emulation for ymm can bring back 3%
performance, still not as good as gather instruction due to thoughput bound.

[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2

2023-08-27 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19

--- Comment #5 from Hongtao.liu  ---
Fixed in GCC14.

[Bug middle-end/111152] ~7-9% performance regression on 510.parest_r SPEC 2017 benchmark

2023-08-25 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52

--- Comment #2 from Hongtao.liu  ---
> With Zen3 -O2 generic lto pgo the regression is less noticeable (only 4%)
> https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=694.457.0

Not sure about this part

[Bug middle-end/111152] ~7-9% performance regression on 510.parest_r SPEC 2017 benchmark

2023-08-25 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #1 from Hongtao.liu  ---
It's PR111064

[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)

2023-08-24 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064

--- Comment #4 from Hongtao.liu  ---
The loop is like

doublefoo (double* a, unsigned* b, double* c, int n)
{
double sum = 0;
for (int i = 0; i != n; i++)
{
sum += a[i] * c[b[i]];
}
return sum;
}  

After disabling gather, is use gather scalar emulation and the cost model is
only profitable for xmm not ymm, which cause the regression.
When manually add -fno-vect-cost-model, the regression is almost gone.

microbenchmark data

[liuhongt@intel gather_emulation]$ ./gather.out
;./nogather_xmm.out;./nogather_ymm.out
elapsed time: 1.75997 seconds for gather with 3000 iterations
elapsed time: 2.42473 seconds for no_gather_xmm with 3000 iterations
elapsed time: 1.86436 seconds for no_gather_ymm with 3000 iterations


And I looked at the cost model 

 299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue
 300_13 + sum_24 1 times vector_stmt costs 16 in epilogue
 301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue
 302_13 + sum_24 2 times vector_stmt costs 32 in body
 303*_3 1 times unaligned_load (misalign -1) costs 16 in body
 304*_3 1 times unaligned_load (misalign -1) costs 16 in body
 305*_7 1 times unaligned_load (misalign -1) costs 16 in body
 306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body
 307*_11 4 times vec_to_scalar costs 80 in body
 308*_11 4 times scalar_load costs 64 in body
 309*_11 1 times vec_construct costs 120 in body
 310*_11 4 times vec_to_scalar costs 80 in body
 311*_11 4 times scalar_load costs 64 in body
 312*_11 1 times vec_construct costs 120 in body
 313_4 * _12 2 times vector_stmt costs 32 in body
 314test.c:6:21: note:  operating on full vectors.
 315test.c:6:21: note:  cost model: epilogue peel iters set to vf/2 because
loop iterations are unknown .
 316*_3 4 times scalar_load costs 64 in epilogue
 317*_7 4 times scalar_load costs 48 in epilogue
 318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue
 319*_11 4 times scalar_load costs 64 in epilogue
 320_4 * _12 4 times scalar_stmt costs 64 in epilogue
 321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue
 322 1 times cond_branch_taken costs 12 in epilogue
 323test.c:6:21: note:  Cost model analysis:
 324  Vector inside of loop cost: 648
 325  Vector prologue cost: 4
 326  Vector epilogue cost: 352
 327  Scalar iteration cost: 80
 328  Scalar outside cost: 24
 329  Vector outside cost: 356
 330  prologue iterations: 0
 331  epilogue iterations: 4
 332test.c:6:21: missed:  cost model: the vector iteration cost = 648 divided
by the scalar iteration cost = 80 is greater or equal to the vectorization
factor = 8.

For gather emulation part, it tries to generate below

2734   [local count: 83964060]:
2735  bnd.23_154 = niters.22_130 >> 2;
2736  _165 = (sizetype) _65;
2737  _166 = _165 * 8;
2738  vectp_a.28_164 = a_18(D) + _166;
2739  _174 = _165 * 4;
2740  vectp_b.32_172 = b_19(D) + _174;
2741  _180 = (sizetype) c_20(D);
2742  vect__33.29_169 = MEM  [(double *)vectp_a.28_164];
2743  vectp_a.27_170 = vectp_a.28_164 + 16;
2744  vect__33.30_171 = MEM  [(double *)vectp_a.27_170];
2745  vect__30.33_177 = MEM  [(unsigned int
*)vectp_b.32_172];
2746  vect__29.34_178 = [vec_unpack_lo_expr] vect__30.33_177;
2747  vect__29.34_179 = [vec_unpack_hi_expr] vect__30.33_177;
2748  _181 = BIT_FIELD_REF ;
2749  _182 = _181 * 8;
2750  _183 = _180 + _182;
2751  _184 = (void *) _183;
2752  _185 = MEM[(double *)_184];
2753  _186 = BIT_FIELD_REF ;
2754  _187 = _186 * 8;
2755  _188 = _180 + _187;
2756  _189 = (void *) _188;
2757  _190 = MEM[(double *)_189];
2758  vect__23.35_191 = {_185, _190};
2759  _192 = BIT_FIELD_REF ;
2760  _193 = _192 * 8;
2761  _194 = _180 + _193;
2762  _195 = (void *) _194;
2763  _196 = MEM[(double *)_195];
2764  _197 = BIT_FIELD_REF ;
2765  _198 = _197 * 8;
2766  _199 = _180 + _198;
2767  _200 = (void *) _199;
2768  _201 = MEM[(double *)_200];
2769  vect__23.36_202 = {_196, _201};
2770  vect__15.37_203 = vect__33.29_169 * vect__23.35_191;
2771  vect__15.37_204 = vect__33.30_171 * vect__23.36_202;
2772  vect_sum_14.38_205 = _162 + vect__15.37_203;
2773  vect_sum_14.38_206 = vect__15.37_204 + vect_sum_14.38_205;
2774  _208 = .REDUC_PLUS (vect_sum_14.38_206);
2775  niters_vector_mult_vf.24_155 = bnd.23_154 << 2;
2776  _157 = (int) niters_vector_mult_vf.24_155;
2777  tmp.25_156 = i_60 + _157;
2778  if (niters.22_130 == niters_vector_mult_vf.24_155)


So there's 1 unaligned_load for index vector(cost 16), and  2 times
vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each
index for the element.

But why do we need that, it's just 8 times scalar_load(cost 128) for index no
need to load it as vector and then vec_promote_demote + vec_to_scalar.

If we calculate cost model correctly total cost 595 < 640(scalar iterator cost
80 * VF 8), then it's still profitable for ymm gather emulation.

[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2

2023-08-24 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19

--- Comment #3 from Hongtao.liu  ---

> I see, we can add an alternative like "noavx2,avx2" to generate
> vmaskmovps/pd when avx2 is not available for integer.

It's better to change assmeble output as
27423  if (TARGET_AVX2)
27424return "vmaskmov\t{%1, %2, %0|%0, %2,
%1}";
27425  else
27426return "vmaskmov\t{%1, %2, %0|%0, %2, %1}";

No need to add alternative.

[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2

2023-08-24 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #0)
> We have
> 
> (define_expand "maskload"
>   [(set (match_operand:V48_AVX2 0 "register_operand")
> (unspec:V48_AVX2
>   [(match_operand: 2 "register_operand")
>(match_operand:V48_AVX2 1 "memory_operand")]
>   UNSPEC_MASKMOV))]
>   "TARGET_AVX")
> 
> and
> 
> (define_mode_iterator V48_AVX2
>   [V4SF V2DF
>V8SF V4DF
>(V4SI "TARGET_AVX2") (V2DI "TARGET_AVX2")
>(V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2")])
> 
> so for example maskloadv4siv4si is disabled with just -mavx while the actual
> instruction can operate just fine on SImode sized data by pretending its
> SFmode.
> 
> check_effective_target_vect_masked_load is conditional on AVX, not AVX2.
> 
> With just AVX we can still use SSE2 vectorization for integer operations
> using
> masked loads/stores from AVX.

I see, we can add an alternative like "noavx2,avx2" to generate vmaskmovps/pd
when avx2 is not available for integer.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-23 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

--- Comment #8 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #7)
> (In reply to Hongtao.liu from comment #6) 
> > > So, the compiler still expects vec_concat/vec_select patterns to be 
> > > present.
> > 
> > v2df foo_v2df (v2df x)
> >  {
> >return __builtin_shuffle (x, (v2df) { 0, 0 }, (v2di) { 0, 2 });
> >  }
> > 
> > The testcase is not a typical vec_merge case, for vec_merge, the shuffle
> > index should be {0, 3}. Here it happened to be a vec_merge because the
> > second vector is all zero. And yes for this case, we still need to
> > vec_concat:vec_select pattern.
> 
> I guess the original patch is the way to go then.

Yes.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-23 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

--- Comment #6 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #4)
> (In reply to Hongtao.liu from comment #3)
> > in x86 backend expand_vec_perm_1, we always tries vec_merge frist for
> > !one_operand_p, expand_vselect_vconcat is only tried when vec_merge failed
> > which means we'd better to use vec_merge instead of vec_select:vec_concat
> > when available in out backend pattern match.
> 
> In fact, I tried to convert existing sse2_movq128 patterns to vec_merge, but
> the patch regressed:
> 
> -FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler movq
> -FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler-not pxor
> -FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-not pxor
> -FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-times
> (?n)(?:mov|psrldq).*%xmm[0-9] 12
> 
> So, the compiler still expects vec_concat/vec_select patterns to be present.


v2df foo_v2df (v2df x)
 {
   return __builtin_shuffle (x, (v2df) { 0, 0 }, (v2di) { 0, 2 });
 }

The testcase is not a typical vec_merge case, for vec_merge, the shuffle index
should be {0, 3}. Here it happened to be a vec_merge because the second vector
is all zero. And yes for this case, we still need to vec_concat:vec_select
pattern.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-22 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---
in x86 backend expand_vec_perm_1, we always tries vec_merge frist for
!one_operand_p, expand_vselect_vconcat is only tried when vec_merge failed
which means we'd better to use vec_merge instead of vec_select:vec_concat when
available in out backend pattern match.

Also for the view of avx512 kmask instructions, use vec_merge will help
constant propagation.

20107  /* Try the SSE4.1 blend variable merge instructions.  */
20108  if (expand_vec_perm_blend (d))
20109return true;
20110
20111  /* Try movss/movsd instructions.  */
20112  if (expand_vec_perm_movs (d))
20113return true;

[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)

2023-08-21 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064

--- Comment #3 from Hongtao.liu  ---
I didn't find the any regression when testing the patch.
Guess it's because my tester is full-copy run and the options are -march=native
-Ofast -flto -funroll-loop.

Let me verify it.

[Bug target/111062] ICE: in final_scan_insn_1, at final.cc:2808 could not split insn {*andndi_1} with -O -mavx10.1-256 -mavx512bw -mno-avx512f

2023-08-20 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111062

--- Comment #1 from Hongtao.liu  ---
(In reply to Zdenek Sojka from comment #0)
> Created attachment 55755 [details]
> reduced testcase
> 
> Compiler output:
> $ x86_64-pc-linux-gnu-gcc -O -mavx10.1-256 -mavx512bw -mno-avx512f testcase.c
> cc1: warning:
> '-mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq}'
> are ignored with '-mavx10.1' and above
Warning message can be a little confusing.
A better formulation might be: with avx10.1 enabled, -mno-avx512f does not
fully disable AVX512-related instructions.

[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.

2023-08-13 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966

--- Comment #4 from Hongtao.liu  ---
(In reply to anlauf from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > (In reply to Richard Biener from comment #1)
> > > I think matmul is fine with avx512f or avx, so requiring/using only the 
> > > base
> > > ISA level sounds fine to me.
> > 
> > Could be potential miss-optimization.
> 
> Do you mean a missed optimzation?
> 
> Or really wrong code?

a missed optimzation.

[Bug target/110979] New: Miss-optimization for O2 fully masked loop on floating point reduction.

2023-08-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979

Bug ID: 110979
   Summary: Miss-optimization for O2 fully masked loop on floating
point reduction.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/YsaesW8zT

float
foo3 (float* __restrict a, int n)
{
float sum = 0.0f;
for (int i = 0; i != 100; i++)
  sum += a[i];
return sum;
}

-O2 -march=znver4 --param vect-partial-vector-usage=2, we get

   [local count: 66437776]:
  # sum_13 = PHI 
  # loop_mask_16 = PHI <_54(3), { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1 }(2)>
  # ivtmp.13_12 = PHI 
  # ivtmp.16_2 = PHI 
  # DEBUG i => NULL
  # DEBUG sum => NULL
  # DEBUG BEGIN_STMT
  _4 = (void *) ivtmp.13_12;
  _11 = &MEM  [(float *)_4];
  vect__4.6_17 = .MASK_LOAD (_11, 32B, loop_mask_16);
  cond_18 = .VCOND_MASK (loop_mask_16, vect__4.6_17, { 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 });
  stmp_sum_10.7_19 = BIT_FIELD_REF ;
  stmp_sum_10.7_20 = sum_13 + stmp_sum_10.7_19;
  stmp_sum_10.7_21 = BIT_FIELD_REF ;
  stmp_sum_10.7_22 = stmp_sum_10.7_20 + stmp_sum_10.7_21;
  stmp_sum_10.7_23 = BIT_FIELD_REF ;
  stmp_sum_10.7_24 = stmp_sum_10.7_22 + stmp_sum_10.7_23;
  stmp_sum_10.7_25 = BIT_FIELD_REF ;
  stmp_sum_10.7_26 = stmp_sum_10.7_24 + stmp_sum_10.7_25;
  stmp_sum_10.7_27 = BIT_FIELD_REF ;
  stmp_sum_10.7_28 = stmp_sum_10.7_26 + stmp_sum_10.7_27;
  stmp_sum_10.7_29 = BIT_FIELD_REF ;
  stmp_sum_10.7_30 = stmp_sum_10.7_28 + stmp_sum_10.7_29;
  stmp_sum_10.7_31 = BIT_FIELD_REF ;
  stmp_sum_10.7_32 = stmp_sum_10.7_30 + stmp_sum_10.7_31;
  stmp_sum_10.7_33 = BIT_FIELD_REF ;
  stmp_sum_10.7_34 = stmp_sum_10.7_32 + stmp_sum_10.7_33;
  stmp_sum_10.7_35 = BIT_FIELD_REF ;
  stmp_sum_10.7_36 = stmp_sum_10.7_34 + stmp_sum_10.7_35;
  stmp_sum_10.7_37 = BIT_FIELD_REF ;
  stmp_sum_10.7_38 = stmp_sum_10.7_36 + stmp_sum_10.7_37;
  stmp_sum_10.7_39 = BIT_FIELD_REF ;
  stmp_sum_10.7_40 = stmp_sum_10.7_38 + stmp_sum_10.7_39;
  stmp_sum_10.7_41 = BIT_FIELD_REF ;
  stmp_sum_10.7_42 = stmp_sum_10.7_40 + stmp_sum_10.7_41;
  stmp_sum_10.7_43 = BIT_FIELD_REF ;
  stmp_sum_10.7_44 = stmp_sum_10.7_42 + stmp_sum_10.7_43;
  stmp_sum_10.7_45 = BIT_FIELD_REF ;
  stmp_sum_10.7_46 = stmp_sum_10.7_44 + stmp_sum_10.7_45;
  stmp_sum_10.7_47 = BIT_FIELD_REF ;
  stmp_sum_10.7_48 = stmp_sum_10.7_46 + stmp_sum_10.7_47;
  stmp_sum_10.7_49 = BIT_FIELD_REF ;
  sum_10 = stmp_sum_10.7_48 + stmp_sum_10.7_49;
  # DEBUG sum => sum_10
  # DEBUG BEGIN_STMT
  # DEBUG i => NULL
  # DEBUG sum => sum_10
  # DEBUG BEGIN_STMT
  _53 = {ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2,
ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2,
ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2};
  _54 = _53 > { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
  ivtmp.13_15 = ivtmp.13_12 + 64;
  ivtmp.16_3 = ivtmp.16_2 + 240;
  if (ivtmp.16_3 != 228)


Looks like an cost model issue?

For aarch64, it looks fine since they have FADDA(Floating-point add
strictly-ordered reduction, accumulating in scalar).

[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.

2023-08-10 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966

--- Comment #2 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> I think matmul is fine with avx512f or avx, so requiring/using only the base
> ISA level sounds fine to me.

Could be potential miss-optimization.

[Bug libfortran/110966] New: should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.

2023-08-09 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966

Bug ID: 110966
   Summary: should matmul_c8_avx512f be updated with
matmul_c8_x86-64-v4.
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libfortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: crazylht at gmail dot com
  Target Milestone: ---

In libgfortran/m4/matmul.m4, we have

#ifdef HAVE_AVX512F
'define(`matmul_name',`matmul_'rtype_code`_avx512f')dnl
`static void
'matmul_name` ('rtype` * const restrict retarray,
'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas,
int blas_limit, blas_call gemm) __attribute__((__target__("avx512f")));
static' include(matmul_internal.m4)dnl
`#endif  /* HAVE_AVX512F */


But target ("avx512f") only enable -mavx512f which has quite limited capability
of AVX512. Since now we have arch level, should we use target("arch=x86-64-v4")
instead.

[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m

2023-08-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926

--- Comment #10 from Hongtao.liu  ---
Fixed in GCC14.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #11 from Hongtao.liu  ---
(In reply to 罗勇刚(Yonggang Luo) from comment #10)
> (In reply to Hongtao.liu from comment #9)
> 
> > > Without `-mbmi` option, gcc can not compile and all other three compiler
> > > can compile.
> > 
> > As long as it keeps semantics(respect zero input), I think this is
> > acceptable.
> 
> Yeap, it's acceptable, but consistence with Clang/MSVC/ICL would be better.
> That would makes the cross-platform code easier, besides, GCC also works for
> WIN32, that's needs GCC to be consistence with MSVC

Sorry for confusion, I meant generating codes like

f(int, int): # @f(int, int)
testedi, edi
je  .LBB0_2
rep   bsf eax, edi
ret
.LBB0_2:
mov eax, 32
ret

w/o mbmi is acceptable as long as it respect zero input.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-07 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #9 from Hongtao.liu  ---

> There is a redundant xor instrunction,
There's false dependence issue on some specific processors.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

> Without `-mbmi` option, gcc can not compile and all other three compiler
> can compile.

As long as it keeps semantics(respect zero input), I think this is acceptable.

[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926

--- Comment #8 from Hongtao.liu  ---
(In reply to Alexander Monakov from comment #7)
> Thanks for identifying the problem. Please don't rename the argument to
> 'op_mask' though: the parameter itself is not a mask, it's an eight-bit
> control word of the vpternlog instruction (holding the logic table of a
> three-operand Boolean function). The function derives a three-bit mask from
> it.

I'll rename it as ternlog_imm8 to avoid confusion.

[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926

--- Comment #6 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #5)
> I'm working on a patch.

 int
-vpternlog_redundant_operand_mask (rtx *operands)
+vpternlog_redundant_operand_mask (rtx op_mask)
 {
   int mask = 0;
-  int imm8 = XINT (operands[4], 0);
+  int imm8 = INTVAL (op_mask);


We should use INTVAL instead of XINT.

[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926

--- Comment #5 from Hongtao.liu  ---
I'm working on a patch.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #7 from Hongtao.liu  ---
(In reply to 罗勇刚(Yonggang Luo) from comment #6)
> MSVC also added, clang seems have optimization issue, but MSVC doesn't have
> that
No, I think what clang does is correct,

f(int, int): # @f(int, int)
testedi, edi   --- when source operand is zero.
je  .LBB0_2
rep   bsf eax, edi
ret
.LBB0_2:
mov eax, 32
ret


 The key difference between TZCNT and BSF instruction is that TZCNT provides
operand size as output when source operand is zero while in the case of BSF
instruction, if source operand is zero, the content of destination operand are
undefined.

https://godbolt.org/z/s74dfdWP4

[Bug target/105504] Fails to break dependency for vcvtss2sd xmm, xmm, mem

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #8 from Hongtao.liu  ---
(In reply to Eric Gallager from comment #7)
> (In reply to CVS Commits from comment #6)
> > The master branch has been updated by hongtao Liu :
> > 
> > https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> > 
> > commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5
> > Author: liuhongt 
> > Date:   Mon May 30 15:30:51 2022 +0800
> > 
> > Disparages SSE_REGS alternatives sligntly with ?v instead of *v in
> > *mov{si,di}_internal.
> > 
> > So alternative v won't be igored in record_reg_classess.
> > 
> > Similar for *r alternatives in some vector patterns.
> > 
> > It helps testcase in the PR, also RA now makes better decisions for
> > gcc.target/i386/extract-insert-combining.c
> > 
> > movd%esi, %xmm0
> > movd%edi, %xmm1
> > -   movl%esi, -12(%rsp)
> > paddd   %xmm0, %xmm1
> > pinsrd  $0, %esi, %xmm0
> > paddd   %xmm1, %xmm0
> > 
> > The patch has no big impact on SPEC2017 for both O2 and Ofast
> > march=native run.
> > 
> > And I noticed there's some changes in SPEC2017 from code like
> > 
> > mov mem, %eax
> > vmovd %eax, %xmm0
> > ..
> > mov %eax, 64(%rsp)
> > 
> > to
> > 
> > vmovd mem, %xmm0
> > ..
> > vmovd %xmm0, 64(%rsp)
> > 
> > Which should be exactly what we want?
> > 
> > gcc/ChangeLog:
> > 
> > PR target/105513
> > PR target/105504
> > * config/i386/i386.md (*movsi_internal): Change alternative
> > from *v to ?v.
> > (*movdi_internal): Ditto.
> > * config/i386/sse.md (vec_set_0): Change alternative *r
> > to ?r.
> > (*vec_extractv4sf_mem): Ditto.
> > (*vec_extracthf): Ditto.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.target/i386/pr105513-1.c: New test.
> > * gcc.target/i386/extract-insert-combining.c: Add new
> > scan-assembler-not for spill.
> 
> Did this fix it?

Yes.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #5 from Hongtao.liu  ---
Maybe source code can be changed as
 int f(int a, int b)
{
#ifdef __BMI__
return _tzcnt_u32   (a);
#else
return _bit_scan_forward (a);
#endif
}

But looks like clang/MSVC doesn't support _bit_scan_forward, should be a bug
for them since it's in the intrinsics guide.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #4 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #3)
> But there's difference between TZCNT and BSF
> 
>  The key difference between TZCNT and BSF instruction is that TZCNT provides
> operand size as output when source operand is zero while in the case of BSF
> instruction.
> 
> Clang looks correct since it also handle zero case, ICC seems wrong, it just
> generates 
> https://godbolt.org/z/WvrsTrjWr

MSCV seems wrong either.

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #3 from Hongtao.liu  ---
But there's difference between TZCNT and BSF

 The key difference between TZCNT and BSF instruction is that TZCNT provides
operand size as output when source operand is zero while in the case of BSF
instruction.

Clang looks correct since it also handle zero case, ICC seems wrong, it just
generates 
https://godbolt.org/z/WvrsTrjWr

[Bug target/110762] [11/12/13 Regression] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #23 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #22)
> It looks to me that partial vector half-float instructions have the same
> issue.

Yes, I'll take a look.

[Bug target/81904] FMA and addsub instructions

2023-07-31 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #7 from Hongtao.liu  ---

> 
> to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly
> creates .VEC_ADDSUB when possible).
Let's put it under -fno-trapping-math.

[Bug target/81904] FMA and addsub instructions

2023-07-30 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #5 from Hongtao.liu  ---
(In reply to Richard Biener from comment #1)
> Hmm, I think the issue is we see
> 
> f (__m128d x, __m128d y, __m128d z)
> {
>   vector(2) double _4;
>   vector(2) double _6;
> 
>[100.00%]:
>   _4 = x_2(D) * y_3(D);
>   _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call]
We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB ->
VEC_FMADDSUB in match.pd?

[Bug target/81904] FMA and addsub instructions

2023-07-30 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904

--- Comment #4 from Hongtao.liu  ---
(In reply to Richard Biener from comment #2)
> __m128d h(__m128d x, __m128d y, __m128d z){
> __m128d tem = _mm_mul_pd (x,y);
> __m128d tem2 = tem + z;
> __m128d tem3 = tem - z;
> return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
> }
> 
> doesn't quite work (the combiner pattern for fmaddsub is missing).  Tried
> {0, 2} as well.
> 
> :
> .LFB5021:
> .cfi_startproc
> vmovapd %xmm0, %xmm3
> vfmsub132pd %xmm1, %xmm2, %xmm0
> vfmadd132pd %xmm1, %xmm2, %xmm3
> vshufpd $2, %xmm0, %xmm3, %xmm0

  tem2_6 = .FMA (x_2(D), y_3(D), z_5(D));
  # DEBUG tem2 => tem2_6
  # DEBUG BEGIN_STMT
  tem3_7 = .FMS (x_2(D), y_3(D), z_5(D));
  # DEBUG tem3 => NULL
  # DEBUG BEGIN_STMT
  _8 = VEC_PERM_EXPR ;

Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms
 looks too complex.

Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-30 Thread crazylht at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #9 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #8)
> (In reply to Richard Biener from comment #6)
> > Do we know whether we could in theory improve the sanitizing by optimization
> > without -funsafe-math-optimizations (I think -fno-trapping-math,
> > -ffinite-math-only -fno-signalling-nans should be a better guard?)?
> 
> Regarding the sanitizing, we can remove all sanitizing MOVQ instructions
> between trapping instructions (IOW, the result of ADDPS is guaranteed to
> have zeros in the high part outside V2SF, so MOVQ is unnecessary in front of
> a follow-up MULPS).
> 
> I think that some instruction back-walking pass on the RTL insn stream would
> be able to identify these unnecessary instructions and remove them.
> 

V2SFmode operand can be produced by direct patterns or SUBREG,
I'm thinking about only sanitizing those V2SFmode operations when there's a
subreg in source operand and make sure every other patterns which set V2SFmode
dest will clear upper bits.(inlucde
mov_internal,vec_concatv2sf_sse4_1,sse_storehps,sse_storehps,*vec_concatv2sf_sse)
for mov_internal, we can just set alternative (v,v) with mode DI, then it
will use vmovq, for other alternatives which set sse_regs, the instructions has
already cleared the upper bits.

For vec_concatv2sf_sse4_1/sse_storehps/sse_storehps/*vec_concatv2sf_sse, we can
change them into define_insn_and_split,  splitting into a V4SF instruction(like
we did for those V2SFmode patterns), and use SUBREG for the dest or explicitly
sanitizing the dest.


BTW looks like *vec_concatv2df_sse4_1 can be merged into *vec_concatv2sf_sse

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1195 matches

Mail list logo