[Bug target/112532] [14 Regression] ICE: in extract_insn, at recog.cc:2804 (unrecognizable insn: vec_duplicate:V4HI) with -O -msse4 since r14-5388-g2794d510b979be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112532 --- Comment #3 from Hongtao.liu --- mine.
gcc-bugs@gcc.gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112104 --- Comment #5 from Hongtao.liu --- (In reply to Andrew Pinski from comment #4) > Fixed via r14-5428-gfd1596f9962569afff6c9298a7c79686c6950bef . Note, my patch only handles constant tripcount for XOR, but not do the transformation when tripcount is variable.
[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374 --- Comment #12 from Hongtao.liu --- > So the testsuite without bootstrap is really unchanged? We still have a Yes, no extra regression observed from gcc testsuite(both w/ and w/o --with-arch=skylake-avx512 --with-cpu=skylake-avx512 in configure) except for the one reported in PR112361 which you have already fixed.
[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374 --- Comment #10 from Hongtao.liu --- Below patch can pass bootstrap --with-arch=skylake-avx512 --with-cpu=skylake-avx512, but didn't observe obvious typo/bug in the pattern. diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 9eefe9ed45b..b6423037ad1 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -17760,24 +17760,6 @@ (define_expand "3" DONE; }) -(define_expand "cond_" - [(set (match_operand:VI48_AVX512VL 0 "register_operand") - (vec_merge:VI48_AVX512VL - (any_logic:VI48_AVX512VL - (match_operand:VI48_AVX512VL 2 "vector_operand") - (match_operand:VI48_AVX512VL 3 "vector_operand")) - (match_operand:VI48_AVX512VL 4 "nonimm_or_0_operand") - (match_operand: 1 "register_operand")))] - "TARGET_AVX512F" -{ - emit_insn (gen_3_mask (operands[0], -operands[2], -operands[3], -operands[4], -operands[1])); - DONE; -}) - (define_expand "3_mask" [(set (match_operand:VI48_AVX512VL 0 "register_operand") (vec_merge:VI48_AVX512VL
[Bug fortran/106402] half preicision is not supported by gfortran(real*2).
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106402 --- Comment #3 from Hongtao.liu --- (In reply to Thomas Koenig from comment #2) > It would make sense to have it, I guess. If somebody has access > to the relevant hardware, it could also be tested :-) x86 support _Float16 operations with float instructions for TARGET_SSE2 and above. So for preliminary validation, any processor with sse2 should be enough.
[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966 --- Comment #6 from Hongtao.liu --- (In reply to Thomas Koenig from comment #5) > (In reply to Hongtao.liu from comment #4) > > (In reply to anlauf from comment #3) > > > (In reply to Hongtao.liu from comment #2) > > > > (In reply to Richard Biener from comment #1) > > > > > I think matmul is fine with avx512f or avx, so requiring/using only > > > > > the base > > > > > ISA level sounds fine to me. > > > > > > > > Could be potential miss-optimization. > > > > > > Do you mean a missed optimzation? > > > > > > Or really wrong code? > > > > a missed optimzation. > > Are there benchmarks which show that the code would indeed run > faster? Not yet, just better in theory . But considering that there might be some tweaks regarding x86-64-v4, I think it's best to leave it unchanged for the time being.
[Bug tree-optimization/112496] [13/14 Regression] ICE: in vectorizable_nonlinear_induction, at tree-vect-loop.cc with bit fields
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112496 --- Comment #3 from Hongtao.liu --- (In reply to Richard Biener from comment #2) > if (TREE_CODE (init_expr) == INTEGER_CST) > init_expr = fold_convert (TREE_TYPE (vectype), init_expr); > else > gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype), >TREE_TYPE (init_expr))); > > and init_expr is a 24 bit integer type while vectype has 32bit components. > > The "fix" is to bail out instead of asserting. Agree.
[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374 --- Comment #9 from Hongtao.liu --- When I remove all cond_ patterns, it passed bootstrap. continue to rootcause the exact pattern which cause the bootstrapped failure
[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443 --- Comment #7 from Hongtao.liu --- Should be Fixed in GCC14/GCC13/GCC12
[Bug target/112443] [12/13/14 Regression] Misoptimization of _mm256_blendv_epi8 intrinsic on avx512bw+avx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112443 --- Comment #1 from Hongtao.liu --- The below can fix that, there's typo for 2 splitters. @@ -17082,7 +17082,7 @@ (define_insn_and_split "*avx2_pcmp3_4" (match_dup 4))] UNSPEC_BLENDV))] { - if (INTVAL (operands[5]) == 1) + if (INTVAL (operands[5]) == 5) std::swap (operands[1], operands[2]); operands[3] = gen_lowpart (mode, operands[3]); }) @@ -17112,7 +17112,7 @@ (define_insn_and_split "*avx2_pcmp3_5" (match_dup 4))] UNSPEC_BLENDV))] { - if (INTVAL (operands[5]) == 1) + if (INTVAL (operands[5]) == 5) std::swap (operands[1], operands[2]); })
[Bug bootstrap/112441] Comparing stages 2 and 3 Bootstrap comparison failure!
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112441 Hongtao.liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Hongtao.liu --- dup *** This bug has been marked as a duplicate of bug 112374 ***
[Bug target/112374] [14 Regression] `--with-arch=skylake-avx512 --with-cpu=skylake-avx512` causes a comparison failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112374 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #7 from Hongtao.liu --- *** Bug 112441 has been marked as a duplicate of this bug. ***
[Bug bootstrap/112441] New: Comparing stages 2 and 3 Bootstrap comparison failure!
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112441 Bug ID: 112441 Summary: Comparing stages 2 and 3 Bootstrap comparison failure! Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- I meet an bootstrapped compare failure with r14-5243-g80f466aa1cce27 My GCC configure is --with-cpu=native --with-arch=native --disable-libsanitizer --enable-checking=yes,rtl,extra --enable-clocale and machine is cascadelake. make[9]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3' make[8]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3' make[7]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/32/libstdc++-v3' make[6]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3' make[5]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3' make[4]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3' make[3]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu/libstdc++-v3' make[2]: Leaving directory '/export/users/liuhongt/tools-build/build_intel-innersource_master_native_bootstrap' make "DESTDIR=" "RPATH_ENVVAR=LD_LIBRARY_PATH" "TARGET_SUBDIR=x86_64-pc-linux-gnu" "bindir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/bin" "datadir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share" "exec_prefix=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap" "includedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/include" "datarootdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share" "docdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/" "infodir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/info" "pdfdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/" "htmldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/doc/" "libdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/lib" "libexecdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/libexec" "lispdir=" "localstatedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/var" "mandir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/share/man" "oldincludedir=/usr/include" "prefix=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap" "sbindir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/sbin" "sharedstatedir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/com" "sysconfdir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/etc" "tooldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu" "build_tooldir=/export/users/liuhongt/install/intel-innersource_master_native_bootstrap/x86_64-pc-linux-gnu" "target_alias=x86_64-pc-linux-gnu" "AWK=gawk" "BISON=bison" "CC_FOR_BUILD=gcc" "CFLAGS_FOR_BUILD=-g -O2" "CXX_FOR_BUILD=g++ -std=c++11" "EXPECT=expect" "FLEX=flex" "INSTALL=/usr/bin/install -c" "INSTALL_DATA=/usr/bin/install -c -m 644" "INSTALL_PROGRAM=/usr/bin/install -c" "INSTALL_SCRIPT=/usr/bin/install -c" "LDFLAGS_FOR_BUILD=" "LEX=flex" "M4=m4" "MAKE=make" "RUNTEST=runtest" "RUNTESTFLAGS=" "SED=/usr/bin/sed" "SHELL=/bin/sh" "YACC=bison -y" "`echo 'ADAFLAGS=' | sed -e s'/[^=][^=]*=$/XFOO=/'`" "ADA_CFLAGS=" "AR_FLAGS=rc" "`echo 'BOOT_ADAFLAGS=-gnatpg' | sed -e s'/[^=][^=]*=$/XFOO=/'`" "BOOT_CFLAGS=-g -O2" "BOOT_LDFLAGS=" "CFLAGS=-g -O2" "CXXFLAGS=-g -O2&q
[Bug target/112393] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1208 with -mavx5124fmaps -Wuninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112393 --- Comment #5 from Hongtao.liu --- Fixed.
[Bug target/112393] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1208 with -mavx5124fmaps -Wuninitialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112393 --- Comment #3 from Hongtao.liu --- Yes, should return true if d->testing_p instead of generate rtl code.
[Bug rtl-optimization/108707] suboptimal allocation with same memory op for many different instructions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108707 Hongtao.liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from Hongtao.liu --- Fixed in GCC14.
[Bug tree-optimization/102383] Missing optimization for PRE after enable O2 vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102383 --- Comment #5 from Hongtao.liu --- It's fixed in GCC12.1
[Bug target/105034] [11/12/13/14 regression]Suboptimal codegen for min/max with -Os
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105034 Hongtao.liu changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #8 from Hongtao.liu --- Looks like it's fixed in latest trunk.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 101956, which changed state. Bug 101956 Summary: Miss vectorization from v4hi to v4df https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101956 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/101956] Miss vectorization from v4hi to v4df
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101956 Hongtao.liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongtao.liu --- Fixed by r14-2007-g6f19cf7526168f
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #4 from Hongtao.liu --- > So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can > be merge together with MAX_EXPR > > Create pr112324.
[Bug middle-end/112324] New: phiopt fail to recog if (b < 0) max = MAX(-b, max); else max = MAX (b, max) into max = MAX (ABS(b), max)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112324 Bug ID: 112324 Summary: phiopt fail to recog if (b < 0) max = MAX(-b, max); else max = MAX (b, max) into max = MAX (ABS(b), max) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- #define MAX(a, b) ((a) > (b) ? (a) : (b)) int foo (int n, int* a) { int max = 0; for (int i = 0; i != n; i++) { int tmp = a[i]; if (tmp < 0) max = MAX (-tmp, max); else max = MAX (tmp, max); } return max; } int foo1 (int n, int* a) { int max = 0; for (int i = 0; i != n; i++) { int tmp = a[i]; max = MAX ((tmp < 0 ? -tmp : tmp), max); } return max; } foo should be same as foo1, but gcc failed to recognize ABS_EXPR in foo. It's from pr110015(originally from source code in openjpeg).
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #3 from Hongtao.liu --- 169test.c:85:23: note: vect_is_simple_use: operand max_38 = PHI , type of def: unknown 170test.c:85:23: missed: Unsupported pattern. 171test.c:62:24: missed: not vectorized: unsupported use in stmt. 172test.c:85:23: missed: unexpected pattern. 173test.c:85:23: note: * Analysis failed with vector mode V8SI 174test.c:85:23: note: * The result for vector mode V32QI would be the same 175test.c:85:23: missed: couldn't vectorize loop 176test.c:65:13: note: vectorized 0 loops in function. 177Removing basic block 5 178;; basic block 5, loop depth 2 179;; pred: 16 180;; 43 181# max_38 = PHI 182# i_42 = PHI 183# datap_44 = PHI 184tmp_24 = *datap_44; 185_35 = tmp_24 < 0; 186_56 = (unsigned int) tmp_24; 187_51 = -_56; 188_1 = (int) _51; 189_25 = MAX_EXPR <_1, max_38>; 190_31 = _1 | -2147483648; 191iftmp.0_27 = (unsigned int) _31; 192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27); 193_26 = MAX_EXPR ; 194max_5 = _35 ? _25 : _26; 195i_29 = i_42 + 1; 196datap_30 = datap_44 + 4; 197if (w_22 > i_29) 198 goto ; [89.00%] 199else 200 goto ; [11.00%] 201;; succ: 16 So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be merge together with MAX_EXPR > manually change the loop to below, then it can be vectorized. for (j = 0; j < t1->h; ++j) { const OPJ_UINT32 w = t1->w; for (i = 0; i < w; ++i, ++datap) { OPJ_INT32 tmp = *datap; if (tmp < 0) { OPJ_UINT32 tmp_unsigned; tmp_unsigned = opj_to_smr(tmp); memcpy(datap, &tmp_unsigned, sizeof(OPJ_INT32)); tmp = -tmp; } max = opj_int_max(max, tmp); } } maybe it's related to phiopt?
[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276 --- Comment #8 from Hongtao.liu --- Fixed.
gcc-bugs@gcc.gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112104 --- Comment #3 from Hongtao.liu --- We already have analyze_and_compute_bitop_with_inv_effect, but it only works when inv is an SSA_NAME, it should be extended to constant.
[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276 --- Comment #4 from Hongtao.liu --- -(define_split - [(set (match_operand:V2HI 0 "register_operand") -(eq:V2HI - (eq:V2HI -(us_minus:V2HI - (match_operand:V2HI 1 "register_operand") - (match_operand:V2HI 2 "register_operand")) -(match_operand:V2HI 3 "const0_operand")) - (match_operand:V2HI 4 "const0_operand")))] - "TARGET_SSE4_1" - [(set (match_dup 0) -(umin:V2HI (match_dup 1) (match_dup 2))) - (set (match_dup 0) -(eq:V2HI (match_dup 0) (match_dup 2)))]) the splitter is wrong when op1 == op2.(the original pattern returns 0, after splitter, it returns 1) So remove the splitter.
[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #3 from Hongtao.liu --- Mine, I'll take a look.
[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972 --- Comment #7 from Hongtao.liu --- (In reply to Andrew Pinski from comment #3) > First off does this even make sense to vectorize but rather do some kind of > scalar reduction with respect to j = j^1 here . Filed PR 112104 for that. > > Basically vectorizing this loop is a waste compared to that. Yes, it's always zero, it would be nice if the middle end can optimize the whole loop off. So for this PR, it's more related to the misoptimization of the redundant loop(better finalize the induction variable with a simple assignment), not vectorization.
[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972 --- Comment #6 from Hongtao.liu --- (In reply to Andrew Pinski from comment #5) > Oh this is the original code: > https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/whets.c > Yes, it's from unixbench.
[Bug tree-optimization/111833] [13/14 Regression] GCC: 14: hangs on a simple for loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111833 --- Comment #5 from Hongtao.liu --- It's the same issue as PR111820, thus should be fixed.
[Bug tree-optimization/111820] [13 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #15 from Hongtao.liu --- (In reply to Richard Biener from comment #13) > (In reply to Hongtao.liu from comment #12) > > Fixed in GCC14, not sure if we want to backport the patch. > > If so, the patch needs to be adjusted since GCC13 doesn't support auto_mpz. > > Yes, we want to backport. Also fixed in GCC13.
[Bug tree-optimization/111972] [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972 Hongtao.liu changed: What|Removed |Added CC||pinskia at gcc dot gnu.org Component|middle-end |tree-optimization --- Comment #1 from Hongtao.liu --- The phiopt change is caused by r14-338-g1dd154f6407658d46faa4d21bfec04fc2551506a
[Bug middle-end/111972] New: [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111972 Bug ID: 111972 Summary: [14 regression] missed vectorzation for bool a = j != 1; j = (long int)a; Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- cat test.c double foo() { long n3 = 345, xtra = 7270; long i,ix; long j; double Check; /* Section 3, Conditional jumps */ j = 0; { for (ix=0; ix2)j = 0; else j = 1; if(j<1)j = 1; else j = 0; } } } Check = Check + (double)j; return Check; } The different between gcc 13 dump and gcc14 dump is GCC13 we have [local count: 1063004411]: # i_16 = PHI # j_18 = PHI <_7(8), j_21(5)> # ivtmp_15 = PHI _7 = j_18 ^ 1; i_13 = i_16 + 1; ivtmp_6 = ivtmp_15 - 1; if (ivtmp_6 != 0) goto ; [99.00%] else goto ; [1.00%] GCC14 we have [local count: 1063004410]: # i_17 = PHI # j_19 = PHI <_14(8), j_22(5)> # ivtmp_16 = PHI _9 = j_19 != 1; _14 = (long int) _9; i_13 = i_17 + 1; ivtmp_15 = ivtmp_16 - 1; if (ivtmp_15 != 0) goto ; [98.99%] else goto ; [1.01%] Vectorizer can handle _7 = j_18 ^ 1; but not _9 = j_19 != 1; _14 = (long int) _9; ../test.C:11:18: note: vect_is_simple_use: operand j_19 != 1, type of def: internal ../test.C:11:18: note: mark relevant 2, live 0: _9 = j_19 != 1; ../test.C:11:18: note: worklist: examine stmt: _9 = j_19 != 1; ../test.C:11:18: note: vect_is_simple_use: operand j_19 = PHI <_14(8), j_22(5)>, type of def: unknown ../test.C:11:18: missed: Unsupported pattern. ../test.C:15:6: missed: not vectorized: unsupported use in stmt. ../test.C:11:18: missed: unexpected pattern. The difference comes from phiopt2.
[Bug target/111874] Missed mask_fold_left_plus with AVX512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874 --- Comment #3 from Hongtao.liu --- > For the case of conditional (or loop masked) fold-left reductions the scalar > fallback isn't implemented. But AVX512 has vpcompress that could be used > to implement a more efficient sequence for a masked fold-left, possibly > using a loop and population count of the mask. There's extra kmov + vpcompress + popcnt, I'm afraid the performance could be worse than the scalar version.
[Bug target/111889] [14 Regression] 128/256 intrins could not be used with only specifying "no-evex512, avx512vl" in function attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111889 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #4 from Hongtao.liu --- Maybe we should disable "no-vex512" for target attribute, only support option for them
[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #12 from Hongtao.liu --- Fixed in GCC14, not sure if we want to backport the patch. If so, the patch needs to be adjusted since GCC13 doesn't support auto_mpz.
[Bug target/111874] Missed mask_fold_left_plus with AVX512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874 --- Comment #1 from Hongtao.liu --- For integer, We have _mm512_mask_reduce_add_epi32 defined as extern __inline int __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A) { __A = _mm512_maskz_mov_epi32 (__U, __A); __MM512_REDUCE_OP (+); } #undef __MM512_REDUCE_OP #define __MM512_REDUCE_OP(op) \ __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);\ __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);\ __m256i __T3 = (__m256i) (__T1 op __T2); \ __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);\ __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);\ __v4si __T6 = __T4 op __T5; \ __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 }); \ __v4si __T8 = __T6 op __T7; \ return __T8[0] op __T8[1] There's correponding floating point version, but it's not in-order adds.
[Bug tree-optimization/111859] 521.wrf_r build failure with -O2 -march=cascadelake --param vect-partial-vector-usage=2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111859 --- Comment #1 from Hongtao.liu --- Could be reproduced with: tar zxvf 521.tar.gz cd 521 gfortran module_advect_em.fppizedi.f90 -S -O2 -march=cascadelake --param vect-partial-vector-usage=2 -std=legacy -fconvert=big-endian
[Bug tree-optimization/111859] New: 521.wrf_r build failure with -O2 -march=cascadelake --param vect-partial-vector-usage=2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111859 Bug ID: 111859 Summary: 521.wrf_r build failure with -O2 -march=cascadelake --param vect-partial-vector-usage=2 Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- Target: x86_64-*-* i?86-*-* Created attachment 56136 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56136&action=edit reproduce source code internal compiler error: in get_vectype_for_scalar_type, at tree-vect-stmts.cc:13153 20xe07a6d get_vectype_for_scalar_type(vec_info*, tree_node*, unsigned int) 3../gcc/tree-vect-stmts.cc:13153 40x277afe1 get_mask_type_for_scalar_type(vec_info*, tree_node*, unsigned int) 5../gcc/tree-vect-stmts.cc:13223 60x277afe1 vect_check_scalar_mask 7../gcc/tree-vect-stmts.cc:2450 80x277b584 vectorizable_call 9../gcc/tree-vect-stmts.cc:3480 100x278ceaf vect_analyze_stmt(vec_info*, _stmt_vec_info*, bool*, _slp_tree*, _slp_instance*, vec*) 11../gcc/tree-vect-stmts.cc:12785 120x18b0430 vect_slp_analyze_node_operations_1 13../gcc/tree-vect-slp.cc:6066 140x18b0430 vect_slp_analyze_node_operations 15../gcc/tree-vect-slp.cc:6265 160x18b0364 vect_slp_analyze_node_operations 17../gcc/tree-vect-slp.cc:6244 180x18b20fb vect_slp_analyze_operations(vec_info*) 19../gcc/tree-vect-slp.cc:6516 200x18b8792 vect_slp_analyze_bb_1 21../gcc/tree-vect-slp.cc:7520 220x18b8792 vect_slp_region 23../gcc/tree-vect-slp.cc:7567 240x18ba7e9 vect_slp_bbs 25../gcc/tree-vect-slp.cc:7775 260x18bab5b vect_slp_function(function*) 27../gcc/tree-vect-slp.cc:7854 280x18c4ee1 execute 29../gcc/tree-vectorizer.cc:1529 30Please submit a full bug report, with preprocessed source (by using -freport-bug). 31Please include the complete backtrace with any bug report. 32See <https://gcc.gnu.org/bugs/> for instructions.
[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #9 from Hongtao.liu --- > But we end up here with niters_skip being INTEGER_CST and .. > > > 1421 || (!vect_use_loop_mask_for_alignment_p (loop_vinfo) > > possibly vect_use_loop_mask_for_alignment_p. Note > LOOP_VINFO_PEELING_FOR_ALIGNMENT < 0 simply means the amount of > peeling is unknown. > > But I wonder how we run into this on x86 without enabling > loop masking ... > > > 1422 && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)) > > 1423{ > > 1424 if (dump_enabled_p ()) > > 1425dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > 1426 "Peeling for alignement is not supported" > > 1427 " for nonlinear induction when niters_skip" > > 1428 " is not constant.\n"); > > 1429 return false; > > 1430} Can you point out where it's assigned as nagative? I saw LOOP_VINFO_MASK_SKIP_NITERS is only assigned in vect_prepare_for_masked_peels. when LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0 it's assigned as vf-npeel(will npeel > vf?) else it's assigned in get_misalign_in_elems and should be positive. HOST_WIDE_INT elem_size = int_cst_value (TYPE_SIZE_UNIT (TREE_TYPE (vectype))); tree elem_size_log = build_int_cst (type, exact_log2 (elem_size)); /* Create: misalign_in_bytes = addr & (target_align - 1). */ tree int_start_addr = fold_convert (type, start_addr); tree misalign_in_bytes = fold_build2 (BIT_AND_EXPR, type, int_start_addr, target_align_minus_1); /* Create: misalign_in_elems = misalign_in_bytes / element_size. */ tree misalign_in_elems = fold_build2 (RSHIFT_EXPR, type, misalign_in_bytes, elem_size_log); return misalign_in_elems; void vect_prepare_for_masked_peels (loop_vec_info loop_vinfo) { tree misalign_in_elems; tree type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo)); gcc_assert (vect_use_loop_mask_for_alignment_p (loop_vinfo)); /* From the information recorded in LOOP_VINFO get the number of iterations that need to be skipped via masking. */ if (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > 0) { poly_int64 misalign = (LOOP_VINFO_VECT_FACTOR (loop_vinfo) - LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)); misalign_in_elems = build_int_cst (type, misalign); } else { gimple_seq seq1 = NULL, seq2 = NULL; misalign_in_elems = get_misalign_in_elems (&seq1, loop_vinfo); misalign_in_elems = fold_convert (type, misalign_in_elems); misalign_in_elems = force_gimple_operand (misalign_in_elems, &seq2, true, NULL_TREE); gimple_seq_add_seq (&seq1, seq2); if (seq1) { edge pe = loop_preheader_edge (LOOP_VINFO_LOOP (loop_vinfo)); basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq1); gcc_assert (!new_bb); } } if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "misalignment for fully-masked loop: %T\n", misalign_in_elems); LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo) = misalign_in_elems; vect_update_inits_of_drs (loop_vinfo, misalign_in_elems, MINUS_EXPR); }
[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #7 from Hongtao.liu --- (In reply to rguent...@suse.de from comment #6) > On Mon, 16 Oct 2023, crazylht at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 > > > > --- Comment #5 from Hongtao.liu --- > > (In reply to Richard Biener from comment #3) > > > for (unsigned i = 0; i != skipn - 1; i++) > > > begin = wi::mul (begin, wi::to_wide (step_expr)); > > > > > > (gdb) p skipn > > > $5 = 4294967292 > > > > > > niters is 4294967292 in vect_update_ivs_after_vectorizer. Maybe the loop > > > should terminate when begin is zero. But I wonder why we pass in 'niters' > > Here, it want to calculate begin * pow (step_expr, skipn), yes we can just > > skip > > the loop when begin is 0. > > I mean terminate it when the multiplication overflowed to zero. for pow (3, skipn), it will never overflowed to zero. To solve this problem once and for all, I'm leaning towards setting a threshold in vect_can_peel_nonlinear_iv_p for vect_step_op_mul,if step_expr is not exact_log2() and niter > TYPE_PRECISION (step_expr) we give up on doing vectorization. > > As for the MASK_ thing the skip is to be interpreted negative (we > should either not use a 'tree' here or make it have the correct type > maybe). Can we even handle this here? It would need to be > a division, no? > > So I think we need to disable non-linear IV or masked peeling for > niter/aligment? But I wonder how we run into this with plain -O3. I think we already disabled negative niters_skip in vect_can_peel_nonlinear_iv_p. 416 /* Also doens't support peel for neg when niter is variable. 1417 ??? generate something like niter_expr & 1 ? init_expr : -init_expr? */ 1418 niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo); 1419 if ((niters_skip != NULL_TREE 1420 && TREE_CODE (niters_skip) != INTEGER_CST) 1421 || (!vect_use_loop_mask_for_alignment_p (loop_vinfo) 1422 && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) < 0)) 1423{ 1424 if (dump_enabled_p ()) 1425dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, 1426 "Peeling for alignement is not supported" 1427 " for nonlinear induction when niters_skip" 1428 " is not constant.\n"); 1429 return false; 1430}
[Bug target/111829] Redudant register moves inside the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829 --- Comment #4 from Hongtao.liu --- (In reply to Richard Biener from comment #2) > You sink the conversion, so it would be PRE on the reverse graph. The > transform doesn't really fit a particular pass I think. The conversions also needs to be hoisted if the initial variable is not constant v2di{0, 0}/v4si{0, 0, 0, 0}
[Bug target/111829] Redudant register moves inside the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829 --- Comment #3 from Hongtao.liu --- (In reply to Richard Biener from comment #2) > You sink the conversion, so it would be PRE on the reverse graph. The > transform doesn't really fit a particular pass I think. > > Why does the problem persist in RTL? Normally, combine will eliminate the redudant move by combine subreg to the pattern like. 1004(insn 19 17 21 3 (set (subreg:V4SI (reg/v:V2DI 103 [ vsum ]) 0) 1005(unspec:V4SI [ 1006(subreg:V4SI (reg/v:V2DI 103 [ vsum ]) 0) 1007(reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ]) 1008(reg:V4SI 124) 1009] UNSPEC_VPDPBUSD)) "test.c":9:16 discrim 1 9182 {vpdpbusd_v4si} but for this case, before combine, cse1/fwprop propagate the subreg(insn 21) from inner loop to outside(insn 28), since there's use for (reg:V4SI 121), combine failed to eliminate the redudnat mov of subreg. --loop_begin-- ... (insn 19 18 20 3 (set (reg:V4SI 121) 393(unspec:V4SI [ 394(reg:V4SI 122 [ vsum ]) 395(reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ]) 396(reg:V4SI 124) 397] UNSPEC_VPDPBUSD)) "test.c":9:16 discrim 1 9182 {vpdpbusd_v4si} 398 (expr_list:REG_DEAD (reg:V4SI 125) 399(expr_list:REG_DEAD (reg:V4SI 123 [ MEM[(__m128i * {ref-all})_52] ]) 400(expr_list:REG_DEAD (reg:V4SI 122 [ vsum ]) 401(nil) 402(insn 20 19 21 3 (set (reg:V4SI 102 [ _11 ]) 403(reg:V4SI 121)) "test.c":9:16 discrim 1 1906 {movv4si_internal} 404 (expr_list:REG_DEAD (reg:V4SI 121) 405(nil))) 406(insn 21 20 22 3 (set (reg/v:V2DI 103 [ vsum ]) 407(subreg:V2DI (reg:V4SI 121) 0)) "test.c":9:16 discrim 2 1909 {movv2di_internal} 408 (nil)) ... -loop_end- 453(note 27 26 28 4 [bb 4] NOTE_INSN_BASIC_BLOCK) 454(insn 28 27 29 4 (set (mem:V2DI (reg/v/f:DI 119 [ pc ]) [0 *pc_22(D)+0 S16 A128]) 455(subreg:V2DI (reg:V4SI 121) 0)) "test.c":11:9 1909 {movv2di_internal} 456 (expr_list:REG_DEAD (reg/v/f:DI 119 [ pc ]) --- propogate from insn 21 457(expr_list:REG_DEAD (reg/v:V2DI 103 [ vsum ]) 458(nil
[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #5 from Hongtao.liu --- (In reply to Richard Biener from comment #3) > for (unsigned i = 0; i != skipn - 1; i++) > begin = wi::mul (begin, wi::to_wide (step_expr)); > > (gdb) p skipn > $5 = 4294967292 > > niters is 4294967292 in vect_update_ivs_after_vectorizer. Maybe the loop > should terminate when begin is zero. But I wonder why we pass in 'niters' Here, it want to calculate begin * pow (step_expr, skipn), yes we can just skip the loop when begin is 0. Also optimize the loop to shift when step_expr is power of 2. But for other cases, the loop is still needed.
[Bug tree-optimization/111820] [13/14 Regression] Compiler time hog in the vectorizer with `-O3 -fno-tree-vrp`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111820 --- Comment #4 from Hongtao.liu --- > niters is 4294967292 in vect_update_ivs_after_vectorizer. Maybe the loop > should terminate when begin is zero. But I wonder why we pass in 'niters' > and then name it 'skip_niters' ... > It's coming from here 9448 niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo); 9449 /* If we are using the loop mask to "peel" for alignment then we need 9450 to adjust the start value here. */ 9451 if (niters_skip != NULL_TREE) 9452init_expr = vect_peel_nonlinear_iv_init (&stmts, init_expr, niters_skip, 9453 step_expr, induction_type); 9454
[Bug target/111829] Redudant register moves inside the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829 --- Comment #1 from Hongtao.liu --- ivtmp.23_31 = (unsigned long) b_24(D); ivtmp.24_46 = (unsigned long) pa_26(D); _50 = ivtmp.23_31 + 40; [local count: 1063004408]: # vsum_35 = PHI # ivtmp.23_14 = PHI # ivtmp.24_30 = PHI _47 = (void *) ivtmp.23_14; _4 = MEM[(int *)_47]; _25 = {_4, _4, _4, _4}; _48 = (void *) ivtmp.24_30; _7 = MEM[(__m128i * {ref-all})_48]; _8 = VIEW_CONVERT_EXPR<__v4si>(_7); _9 = VIEW_CONVERT_EXPR<__v4si>(vsum_35); _27 = __builtin_ia32_vpdpbusd_v4si (_9, _8, _25); vsum_28 = VIEW_CONVERT_EXPR<__m128i>(_27); ivtmp.23_15 = ivtmp.23_14 + 4; ivtmp.24_45 = ivtmp.24_30 + 16; if (ivtmp.23_15 != _50) goto ; [98.99%] else goto ; [1.01%] [local count: 10737416]: *pc_19(D) = vsum_28; ivtmp.15_34 = (unsigned long) &vsum.0; _13 = ivtmp.15_34 + 16; [local count: 42949663]: # ssum_38 = PHI # ivtmp.15_33 = PHI I'm curious if we can "move" VIEW_EXPR_CONVERT outside of the loop as below [local count: 1063004408]: - # vsum_35 = PHI + # _9 = PHI <_27(3), { 0, 0, 0, 0}(2)> # ivtmp.23_14 = PHI # ivtmp.24_30 = PHI _47 = (void *) ivtmp.23_14; _4 = MEM[(int *)_47]; _25 = {_4, _4, _4, _4}; _48 = (void *) ivtmp.24_30; _7 = MEM[(__m128i * {ref-all})_48]; _8 = VIEW_CONVERT_EXPR<__v4si>(_7); - _9 = VIEW_CONVERT_EXPR<__v4si>(vsum_35); _27 = __builtin_ia32_vpdpbusd_v4si (_9, _8, _25); - vsum_28 = VIEW_CONVERT_EXPR<__m128i>(_27); ivtmp.23_15 = ivtmp.23_14 + 4; ivtmp.24_45 = ivtmp.24_30 + 16; if (ivtmp.23_15 != _50) goto ; [98.99%] else goto ; [1.01%] [local count: 10737416]: + vsum_28 = VIEW_CONVERT_EXPR <_27> *pc_19(D) = vsum_28; ivtmp.15_34 = (unsigned long) &vsum.0; _13 = ivtmp.15_34 + 16; [local count: 42949663]: # ssum_38 = PHI # ivtmp.15_33 = PHI It looks like an lazy code motion optimization, but currently not handled by PRE.
[Bug target/111829] New: Redudant register moves inside the loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111829 Bug ID: 111829 Summary: Redudant register moves inside the loop Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- Target: x86_64-*-* i?86-*-* #include int foo (__m128i* __restrict pa, int* b, __m128i* __restrict pc, int n) { __m128i vsum = _mm_setzero_si128(); for (int i = 0; i != 10; i++) { vsum = _mm_dpbusd_epi32 (vsum, pa[i], _mm_set1_epi32 (b[i])); } *pc = vsum; int ssum = 0; for (int i = 0; i != 4; i++) ssum += ((__v4si)vsum)[i]; return ssum; } gcc -O2 -mavxvnni foo(long long __vector(2)*, int*, long long __vector(2)*, int): leaq40(%rsi), %rax vpxor %xmm0, %xmm0, %xmm0 .L2: vmovdqa (%rdi), %xmm2 vmovdqa %xmm0, %xmm1 redundant addq$4, %rsi addq$16, %rdi vpbroadcastd-4(%rsi), %xmm3 {vex} vpdpbusd %xmm3, %xmm2, %xmm1 vmovdqa %xmm1, %xmm0 --- redundant cmpq%rax, %rsi jne .L2 vmovdqa %xmm1, (%rdx) leaq-24(%rsp), %rax leaq-8(%rsp), %rcx xorl%edx, %edx .L3: vmovdqa %xmm0, -24(%rsp) addq$4, %rax addl-4(%rax), %edx cmpq%rax, %rcx jne .L3 movl%edx, %eax ret it can be better with foo(long long __vector(2)*, int*, long long __vector(2)*, int): leaq40(%rsi), %rax vpxor %xmm0, %xmm0, %xmm0 .L2: vmovdqa (%rdi), %xmm2 addq$4, %rsi addq$16, %rdi vpbroadcastd-4(%rsi), %xmm3 {vex} vpdpbusd %xmm3, %xmm2, %xmm0 cmpq%rax, %rsi jne .L2 vmovdqa %xmm0, (%rdx) leaq-24(%rsp), %rax leaq-8(%rsp), %rcx xorl%edx, %edx .L3: vmovdqa %xmm0, -24(%rsp) addq$4, %rax addl-4(%rax), %edx cmpq%rax, %rcx jne .L3 movl%edx, %eax ret
[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768 --- Comment #10 from Hongtao.liu --- > indeed (but I believe it did happen with Alder Lake already, by accident, > with AVX512 on P-cores but not on E-cores). AVX512 is physically fused off for Alderlake P-core, P-core and E-core share the same ISA level(AVX2).
[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768 --- Comment #4 from Hongtao.liu --- I checked Alderlake's L1 cachesize and it is indeed 48, and L1 cachesize in alderlake_cost is set to 32. But then again, we have a lot of different platforms that share the same cost and they may have different L1 cachesizes, but from a micro-architecture tuning point of view, it doesn't make a difference. A separate cost if only the L1 cachesize is different is quite unnecessary(the size itself is just a parameter for the software prefetch, it doesn't have to be real hardware cachesize)
[Bug target/111745] [14 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) with -ffloat-store -mavx512fp16 -mavx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111745 --- Comment #3 from Hongtao.liu --- Fixed.
[Bug target/104610] memcmp () == 0 can be optimized better for avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104610 --- Comment #22 from Hongtao.liu --- For 64-byte memory comparison int compare (const char* s1, const char* s2) { return __builtin_memcmp (s1, s2, 64) == 0; } We're generating vmovdqu (%rsi), %ymm0 vpxorq (%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 jne .L2 vmovdqu 32(%rsi), %ymm0 vpxorq 32(%rdi), %ymm0, %ymm0 vptest %ymm0, %ymm0 je .L5 .L2: movl$1, %eax xorl$1, %eax vzeroupper ret An alternative way is using vpcmpeq + kortest and check Carry bit vmovdqu64 (%rsi), %zmm0 xorl%eax, %eax vpcmpeqd(%rdi), %zmm0, %k0 kortestw%k0, %k0 setc%al vzeroupper Not sure if it's better or not.
[Bug target/111745] [14 Regression] ICE: in extract_insn, at recog.cc:2791 (unrecognizable insn) with -ffloat-store -mavx512fp16 -mavx512vl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111745 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #1 from Hongtao.liu --- Mine, I'll take a look.
[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731 --- Comment #2 from Hongtao.liu --- The original project is too complex for me to come up with a reproduction case, I can help with gdb if additional information is needed.
[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731 --- Comment #1 from Hongtao.liu --- GCC11.3 is ok, GCC13.2 and later have the issue, I didn't verify GCC12.
[Bug libgcc/111731] New: [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731 Bug ID: 111731 Summary: [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgcc Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- The issue is not solved by PR110956'fix. I did some debugging with gdb, and here are the logs: The first time gdb stop at https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c#L143 │ 138 ob->next = unseen_objects; │ 139 unseen_objects = ob; │ 140 │ 141 __gthread_mutex_unlock (&object_mutex); │ 142 #endif │ >143 } (gdb) frame #0 __register_frame_info_bases (begin=0x7fffd551e000, ob=0x1e386d0, tbase=0x0, dbase=0x0) at ../../../libgcc/unwind-dw2-fde.c:143 (gdb) p registered_frames->root->entry_count $31 = 2 (gdb) p registered_frames->root->content.entries[0] $32 = {base = 140736772300800, size = 1, ob = 0x1e386d0} (gdb) p registered_frames->root->content.entries[1] $33 = {base = 140736772317184, size = 178483158, ob = 0x1e386d0} The second time gdb stop at https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-fde.c#L143 │ 138 ob->next = unseen_objects; │ 139 unseen_objects = ob; │ 140 │ 141 __gthread_mutex_unlock (&object_mutex); │ 142 #endif │ >143 } (gdb) frame #0 __register_frame_info_bases (begin=0x7fffd409c000, ob=0x26b2e00, tbase=0x0, dbase=0x0) at ../../../libgcc/unwind-dw2-fde.c:143 (gdb) p registered_frames->root->entry_count $34 = 4 (gdb) p registered_frames->root->content.entries[0] $35 = {base = 140736750796800, size = 1, ob = 0x26b2e00} (gdb) p registered_frames->root->content.entries[1] $36 = {base = 140736750817280, size = 199987168, ob = 0x26b2e00} (gdb) p registered_frames->root->content.entries[2] $37 = {base = 140736772300800, size = 1, ob = 0x1e386d0} (gdb) p registered_frames->root->content.entries[3] $38 = {base = 140736772317184, size = 178483158, ob = 0x1e386d0} The first time gdb stop at unexpected line https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-btree.h#L829: │ 825 unsigned slot = btree_node_find_leaf_slot (iter, base); │ 826 if ((slot >= iter->entry_count) || (iter->content.entries[slot].base != base)) │ 827 { │ 828 // Not found, this should never happen. │ >829 btree_node_unlock_exclusive (iter); │ 830 return NULL; │ 831 } (gdb) p slot $26 = 1 (gdb) p iter->content.entries[slot] $27 = {base = 140736750817280, size = 199987168, ob = 0x26e7900} (gdb) p iter->content.entries[2] $28 = {base = 140736772300800, size = 1, ob = 0x1e386d0} We can see that when we try to remove btree node of 0x7fffd551e000(140736772300800). The return value of btree_node_find_leaf_slot is 1, but I think it should return 2. Both btree_insert and btree_remove will call // Find the position for a slot in a leaf node. static unsigned btree_node_find_leaf_slot (const struct btree_node *n, uintptr_type value) { for (unsigned index = 0, ec = n->entry_count; index != ec; ++index) if (n->content.entries[index].base + n->content.entries[index].size > value) return index; return n->entry_count; } But registered_frames->root->content.entries[1].base + registered_frames->root->content.entries[1].size > registered_frames->root->content.entries[2].base registered_frames->root->content.entries[2].base + registered_frames->root->content.entries[2].size > registered_frames->root->content.entries[1].base and it makes btree_node_find_leaf_slot return wrong slot(at btree_insert, it will return slot 1 for base1, and move base2 to slot2, but at btree_remove, it still return slot 1 bacause of upper logic), I'm not sure if this is the rootcause.
[Bug tree-optimization/111402] Loop distribution fail to optimize memmove for multiple consecutive moves within a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111402 --- Comment #2 from Hongtao.liu --- Adjust code in foo1, use < n instead of != n, the issue remains. void foo1 (v4di* __restrict a, v4di *b, int n) { for (int i = 0; i < n; i+=2) { a[i] = b[i]; a[i+1] = b[i+1]; } }
[Bug middle-end/111402] New: Loop distribution fail to optimize memmove for multiple consecutive moves within a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111402 Bug ID: 111402 Summary: Loop distribution fail to optimize memmove for multiple consecutive moves within a loop Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- cat test.c typedef long long v4di __attribute__((vector_size(32))); void foo (v4di* __restrict a, v4di *b, int n) { for (int i = 0; i != n; i++) a[i] = b[i]; } void foo1 (v4di* __restrict a, v4di *b, int n) { for (int i = 0; i != n; i+=2) { a[i] = b[i]; a[i+1] = b[i+1]; } } gcc -O2 -S test.c GCC can optimize loop in foo to memmove, but not for loop in foo1. This is from PR111354
[Bug target/111354] [7/10/12 regression] The instructions of the DPDK demo program are different and run time increases.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111354 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #5 from Hongtao.liu --- void rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n) { __m256i ymm0, ymm1, ymm2, ymm3; while (n >= 128) { ymm0 = _mm256_loadu_si256((const __m256i *)(const void *) ((const uint8_t *)src + 0 * 32)); n -= 128; ymm1 = _mm256_loadu_si256((const __m256i *)(const void *) ((const uint8_t *)src + 1 * 32)); ymm2 = _mm256_loadu_si256((const __m256i *)(const void *) ((const uint8_t *)src + 2 * 32)); ymm3 = _mm256_loadu_si256((const __m256i *)(const void *) ((const uint8_t *)src + 3 * 32)); src = (const uint8_t *)src + 128; _mm256_storeu_si256((__m256i *)(void *) ((uint8_t *)dst + 0 * 32), ymm0); _mm256_storeu_si256((__m256i *)(void *) ((uint8_t *)dst + 1 * 32), ymm1); _mm256_storeu_si256((__m256i *)(void *) ((uint8_t *)dst + 2 * 32), ymm2); _mm256_storeu_si256((__m256i *)(void *) ((uint8_t *)dst + 3 * 32), ymm3); dst = (uint8_t *)dst + 128; } } I'm curious if we can distribute the uppper as an memmove?(of course, compiler needs to know 2 array don't alias each other.
[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306 --- Comment #8 from Hongtao.liu --- Fixed in GCC14.1 GCC13.3 GCC12.4
[Bug target/111335] fmaddpch seems not commutative for operands[1] and operands[2] due to precision loss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111335 Hongtao.liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongtao.liu --- Fixed in GCC14.1 GCC13.3 GCC12.4
[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306 --- Comment #4 from Hongtao.liu --- A related PR111335 for fmaddcph , similar but not the same, PR111335 is due to precision difference for complex _Float16 fma, fmaddcph a, b, c is not equal to fmaddcph b, a, c
[Bug target/111335] New: fmaddpch seems not commutative for operands[1] and operands[2] due to precision loss
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111335 Bug ID: 111335 Summary: fmaddpch seems not commutative for operands[1] and operands[2] due to precision loss Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- fmaddcph is complex _Float16 fma. cat test.c #include #include void func(_Float16 a[], _Float16 b[], _Float16 c[]) { const __m128h r0 = _mm_loadu_ph(a); const __m128h r1 = _mm_loadu_ph(b); const __m128h r2 = _mm_loadu_ph(c); const __m128h mul = _mm_fmadd_pch(r0, r1, r2); printf("%f %f\n", (float)mul[0], (float)mul[1]); } int main() { _Float16 a[8] = {-0.7949218f16, +0.2739257f16}; _Float16 b[8] = {+0.0010070f16, +0.0015659f16}; _Float16 c[8] = {-0.0010366f16, -0.0018014f16}; func(a, b, c); return 0; } g++ -O0 -march=sapphirerapids test.c, we get fmaddpch a, b, c, and the result is -0.002266 -0.002769 g++ -O0 -march=sapphirerapids test.c, we get fmaddpch b, a, c, and the result is -0.002266 -0.002771
[Bug target/111306] [12,13] macro-fusion makes error on conjugate complex multiplication fp16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111306 --- Comment #3 from Hongtao.liu --- A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629650.html
[Bug target/111333] Runtime failure for fcmulcph instrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333 --- Comment #2 from Hongtao.liu --- The test failed since GCC12 when the pattern is added
[Bug target/111333] Runtime failure for fcmulcph instrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333 --- Comment #1 from Hongtao.liu --- fmulcph/fmaddcph is commutative for operands[1] and operands[2], but fcmulcph/fcmaddcph is not, since it's Complex conjugate operations. Below change fixes the issue. diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 6d3ae8dea0c..833546c5228 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -6480,6 +6480,14 @@ (define_int_attr complexpairopname [(UNSPEC_COMPLEX_FMA_PAIR "fmaddc") (UNSPEC_COMPLEX_FCMA_PAIR "fcmaddc")]) +(define_int_attr int_comm + [(UNSPEC_COMPLEX_FMA "%") +(UNSPEC_COMPLEX_FMA_PAIR "%") +(UNSPEC_COMPLEX_FCMA "") +(UNSPEC_COMPLEX_FCMA_PAIR "") +(UNSPEC_COMPLEX_FMUL "%") +(UNSPEC_COMPLEX_FCMUL "")]) + (define_int_attr conj_op [(UNSPEC_COMPLEX_FMA "") (UNSPEC_COMPLEX_FCMA "_conj") @@ -6593,7 +6601,7 @@ (define_expand "cmla4" (define_insn "fma__" [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v") (unspec:VHF_AVX512VL - [(match_operand:VHF_AVX512VL 1 "" "%v") + [(match_operand:VHF_AVX512VL 1 "" "v") (match_operand:VHF_AVX512VL 2 "" "") (match_operand:VHF_AVX512VL 3 "" "0")] UNSPEC_COMPLEX_F_C_MA))] @@ -6658,7 +,7 @@ (define_insn_and_split "fma___fma_zero" (define_insn "fma___pair" [(set (match_operand:VF1_AVX512VL 0 "register_operand" "=&v") (unspec:VF1_AVX512VL -[(match_operand:VF1_AVX512VL 1 "vector_operand" "%v") +[(match_operand:VF1_AVX512VL 1 "vector_operand" "v") (match_operand:VF1_AVX512VL 2 "bcst_vector_operand" "vmBr") (match_operand:VF1_AVX512VL 3 "vector_operand" "0")] UNSPEC_COMPLEX_F_C_MA_PAIR))] @@ -6727,7 +6735,7 @@ (define_insn "___mask" [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v") (vec_merge:VHF_AVX512VL (unspec:VHF_AVX512VL - [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "%v") + [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "v") (match_operand:VHF_AVX512VL 2 "nonimmediate_operand" "") (match_operand:VHF_AVX512VL 3 "register_operand" "0")] UNSPEC_COMPLEX_F_C_MA) @@ -6752,7 +6760,7 @@ (define_expand "cmul3" (define_insn "__" [(set (match_operand:VHF_AVX512VL 0 "register_operand" "=&v") (unspec:VHF_AVX512VL - [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "%v") + [(match_operand:VHF_AVX512VL 1 "nonimmediate_operand" "v") (match_operand:VHF_AVX512VL 2 "nonimmediate_operand" "")] UNSPEC_COMPLEX_F_C_MUL))] "TARGET_AVX512FP16 && "
[Bug target/111333] New: Runtime failure for fcmulcph instrinsic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111333 Bug ID: 111333 Summary: Runtime failure for fcmulcph instrinsic Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- cat main.cpp #include #include __attribute__((optimize("O0"))) auto func0(_Float16 *a, _Float16 *b, int n, _Float16 *c) { __m512h rA = _mm512_loadu_ph(a); for (int i = 0; i < n; i += 32) { __m512h rB = _mm512_loadu_ph(b + i); _mm512_storeu_ph(c + i, _mm512_fcmul_pch(rB, rA)); } } __attribute__((optimize("O2"))) auto func1(_Float16 *a, _Float16 *b, int n, _Float16 *c) { __m512h rA = _mm512_loadu_ph(a); for (int i = 0; i < n; i += 32) { __m512h rB = _mm512_loadu_ph(b + i); _mm512_storeu_ph(c + i, _mm512_fcmul_pch(rB, rA)); } } int main() { int n = 32; _Float16 a[n], b[n], c[n]; for (int i = 1; i <= n; i++) { a[i - 1] = i & 1 ? -i : i; b[i - 1] = i; } printf("a = %f + %fi \n", (float)a[0], (float)a[1]); printf("b = %f + %fi \n", (float)b[0], (float)b[1]); printf("b * conj(a) = %f + %fi \n\n", (float)(a[0]*b[0] + a[1]*b[1]), (float)(a[0]*b[1] - a[1]*b[0])); func0(a, b, n, c); for (int i = 0; i < n / 32 * 2; i++) { printf("%f ", (float)c[i]); } printf("\n"); func1(a, b, n, c); for (int i = 0; i < n / 32 * 2; i++) { printf("%f ", (float)c[i]); } printf("\n"); return 0; } g++ -march=sapphirerapids main.cpp -o test sde -spr-- ./test a = -1.00 + 2.00i b = 1.00 + 2.00i b * conj(a) = 3.00 + -4.00i 3.00 -4.00 3.00 4.00
[Bug target/111225] ICE in curr_insn_transform, unable to generate reloads for xor, since r14-2447-g13c556d6ae84be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111225 --- Comment #2 from Hongtao.liu --- (In reply to Hongtao.liu from comment #1) > So reload thought CT_SPECIAL_MEMORY is always win for spilled_pesudo_p, but > here Br should be a vec_dup:mem which doesn't match spilled_pseduo_p. > > case CT_SPECIAL_MEMORY: > if (satisfies_memory_constraint_p (op, cn)) > win = true; > else if (spilled_pseudo_p (op)) > win = true; > break; vmBr constraint is ok as long as m is matched before Br, but here m in invalid then exposed the problem. The backend walkaround is disabling Br when m is not availble. Or the middle-end fix should be removing win for spilled_pseudo_p (op) in CT_SPECIAL_MEMORY.
[Bug target/111225] ICE in curr_insn_transform, unable to generate reloads for xor, since r14-2447-g13c556d6ae84be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111225 --- Comment #1 from Hongtao.liu --- So reload thought CT_SPECIAL_MEMORY is always win for spilled_pesudo_p, but here Br should be a vec_dup:mem which doesn't match spilled_pseduo_p. case CT_SPECIAL_MEMORY: if (satisfies_memory_constraint_p (op, cn)) win = true; else if (spilled_pseudo_p (op)) win = true; break;
[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064 --- Comment #6 from Hongtao.liu --- > > [liuhongt@intel gather_emulation]$ ./gather.out > ;./nogather_xmm.out;./nogather_ymm.out > elapsed time: 1.75997 seconds for gather with 3000 iterations > elapsed time: 2.42473 seconds for no_gather_xmm with 3000 iterations > elapsed time: 1.86436 seconds for no_gather_ymm with 3000 iterations > For 510.parest_r, enable gather emulation for ymm can bring back 3% performance, still not as good as gather instruction due to thoughput bound.
[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19 --- Comment #5 from Hongtao.liu --- Fixed in GCC14.
[Bug middle-end/111152] ~7-9% performance regression on 510.parest_r SPEC 2017 benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52 --- Comment #2 from Hongtao.liu --- > With Zen3 -O2 generic lto pgo the regression is less noticeable (only 4%) > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=694.457.0 Not sure about this part
[Bug middle-end/111152] ~7-9% performance regression on 510.parest_r SPEC 2017 benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #1 from Hongtao.liu --- It's PR111064
[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064 --- Comment #4 from Hongtao.liu --- The loop is like doublefoo (double* a, unsigned* b, double* c, int n) { double sum = 0; for (int i = 0; i != n; i++) { sum += a[i] * c[b[i]]; } return sum; } After disabling gather, is use gather scalar emulation and the cost model is only profitable for xmm not ymm, which cause the regression. When manually add -fno-vect-cost-model, the regression is almost gone. microbenchmark data [liuhongt@intel gather_emulation]$ ./gather.out ;./nogather_xmm.out;./nogather_ymm.out elapsed time: 1.75997 seconds for gather with 3000 iterations elapsed time: 2.42473 seconds for no_gather_xmm with 3000 iterations elapsed time: 1.86436 seconds for no_gather_ymm with 3000 iterations And I looked at the cost model 299_13 + sum_24 1 times scalar_to_vec costs 4 in prologue 300_13 + sum_24 1 times vector_stmt costs 16 in epilogue 301_13 + sum_24 1 times vec_to_scalar costs 4 in epilogue 302_13 + sum_24 2 times vector_stmt costs 32 in body 303*_3 1 times unaligned_load (misalign -1) costs 16 in body 304*_3 1 times unaligned_load (misalign -1) costs 16 in body 305*_7 1 times unaligned_load (misalign -1) costs 16 in body 306(long unsigned int) _8 2 times vec_promote_demote costs 8 in body 307*_11 4 times vec_to_scalar costs 80 in body 308*_11 4 times scalar_load costs 64 in body 309*_11 1 times vec_construct costs 120 in body 310*_11 4 times vec_to_scalar costs 80 in body 311*_11 4 times scalar_load costs 64 in body 312*_11 1 times vec_construct costs 120 in body 313_4 * _12 2 times vector_stmt costs 32 in body 314test.c:6:21: note: operating on full vectors. 315test.c:6:21: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . 316*_3 4 times scalar_load costs 64 in epilogue 317*_7 4 times scalar_load costs 48 in epilogue 318(long unsigned int) _8 4 times scalar_stmt costs 16 in epilogue 319*_11 4 times scalar_load costs 64 in epilogue 320_4 * _12 4 times scalar_stmt costs 64 in epilogue 321_13 + sum_24 4 times scalar_stmt costs 64 in epilogue 322 1 times cond_branch_taken costs 12 in epilogue 323test.c:6:21: note: Cost model analysis: 324 Vector inside of loop cost: 648 325 Vector prologue cost: 4 326 Vector epilogue cost: 352 327 Scalar iteration cost: 80 328 Scalar outside cost: 24 329 Vector outside cost: 356 330 prologue iterations: 0 331 epilogue iterations: 4 332test.c:6:21: missed: cost model: the vector iteration cost = 648 divided by the scalar iteration cost = 80 is greater or equal to the vectorization factor = 8. For gather emulation part, it tries to generate below 2734 [local count: 83964060]: 2735 bnd.23_154 = niters.22_130 >> 2; 2736 _165 = (sizetype) _65; 2737 _166 = _165 * 8; 2738 vectp_a.28_164 = a_18(D) + _166; 2739 _174 = _165 * 4; 2740 vectp_b.32_172 = b_19(D) + _174; 2741 _180 = (sizetype) c_20(D); 2742 vect__33.29_169 = MEM [(double *)vectp_a.28_164]; 2743 vectp_a.27_170 = vectp_a.28_164 + 16; 2744 vect__33.30_171 = MEM [(double *)vectp_a.27_170]; 2745 vect__30.33_177 = MEM [(unsigned int *)vectp_b.32_172]; 2746 vect__29.34_178 = [vec_unpack_lo_expr] vect__30.33_177; 2747 vect__29.34_179 = [vec_unpack_hi_expr] vect__30.33_177; 2748 _181 = BIT_FIELD_REF ; 2749 _182 = _181 * 8; 2750 _183 = _180 + _182; 2751 _184 = (void *) _183; 2752 _185 = MEM[(double *)_184]; 2753 _186 = BIT_FIELD_REF ; 2754 _187 = _186 * 8; 2755 _188 = _180 + _187; 2756 _189 = (void *) _188; 2757 _190 = MEM[(double *)_189]; 2758 vect__23.35_191 = {_185, _190}; 2759 _192 = BIT_FIELD_REF ; 2760 _193 = _192 * 8; 2761 _194 = _180 + _193; 2762 _195 = (void *) _194; 2763 _196 = MEM[(double *)_195]; 2764 _197 = BIT_FIELD_REF ; 2765 _198 = _197 * 8; 2766 _199 = _180 + _198; 2767 _200 = (void *) _199; 2768 _201 = MEM[(double *)_200]; 2769 vect__23.36_202 = {_196, _201}; 2770 vect__15.37_203 = vect__33.29_169 * vect__23.35_191; 2771 vect__15.37_204 = vect__33.30_171 * vect__23.36_202; 2772 vect_sum_14.38_205 = _162 + vect__15.37_203; 2773 vect_sum_14.38_206 = vect__15.37_204 + vect_sum_14.38_205; 2774 _208 = .REDUC_PLUS (vect_sum_14.38_206); 2775 niters_vector_mult_vf.24_155 = bnd.23_154 << 2; 2776 _157 = (int) niters_vector_mult_vf.24_155; 2777 tmp.25_156 = i_60 + _157; 2778 if (niters.22_130 == niters_vector_mult_vf.24_155) So there's 1 unaligned_load for index vector(cost 16), and 2 times vec_promote_demote(cost 8), and 8 times vec_to_scalar(cost 160) to get each index for the element. But why do we need that, it's just 8 times scalar_load(cost 128) for index no need to load it as vector and then vec_promote_demote + vec_to_scalar. If we calculate cost model correctly total cost 595 < 640(scalar iterator cost 80 * VF 8), then it's still profitable for ymm gather emulation.
[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19 --- Comment #3 from Hongtao.liu --- > I see, we can add an alternative like "noavx2,avx2" to generate > vmaskmovps/pd when avx2 is not available for integer. It's better to change assmeble output as 27423 if (TARGET_AVX2) 27424return "vmaskmov\t{%1, %2, %0|%0, %2, %1}"; 27425 else 27426return "vmaskmov\t{%1, %2, %0|%0, %2, %1}"; No need to add alternative.
[Bug target/111119] maskload and maskstore for integer modes are oddly conditional on AVX2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #2 from Hongtao.liu --- (In reply to Richard Biener from comment #0) > We have > > (define_expand "maskload" > [(set (match_operand:V48_AVX2 0 "register_operand") > (unspec:V48_AVX2 > [(match_operand: 2 "register_operand") >(match_operand:V48_AVX2 1 "memory_operand")] > UNSPEC_MASKMOV))] > "TARGET_AVX") > > and > > (define_mode_iterator V48_AVX2 > [V4SF V2DF >V8SF V4DF >(V4SI "TARGET_AVX2") (V2DI "TARGET_AVX2") >(V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2")]) > > so for example maskloadv4siv4si is disabled with just -mavx while the actual > instruction can operate just fine on SImode sized data by pretending its > SFmode. > > check_effective_target_vect_masked_load is conditional on AVX, not AVX2. > > With just AVX we can still use SSE2 vectorization for integer operations > using > masked loads/stores from AVX. I see, we can add an alternative like "noavx2,avx2" to generate vmaskmovps/pd when avx2 is not available for integer.
[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866 --- Comment #8 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #7) > (In reply to Hongtao.liu from comment #6) > > > So, the compiler still expects vec_concat/vec_select patterns to be > > > present. > > > > v2df foo_v2df (v2df x) > > { > >return __builtin_shuffle (x, (v2df) { 0, 0 }, (v2di) { 0, 2 }); > > } > > > > The testcase is not a typical vec_merge case, for vec_merge, the shuffle > > index should be {0, 3}. Here it happened to be a vec_merge because the > > second vector is all zero. And yes for this case, we still need to > > vec_concat:vec_select pattern. > > I guess the original patch is the way to go then. Yes.
[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866 --- Comment #6 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #4) > (In reply to Hongtao.liu from comment #3) > > in x86 backend expand_vec_perm_1, we always tries vec_merge frist for > > !one_operand_p, expand_vselect_vconcat is only tried when vec_merge failed > > which means we'd better to use vec_merge instead of vec_select:vec_concat > > when available in out backend pattern match. > > In fact, I tried to convert existing sse2_movq128 patterns to vec_merge, but > the patch regressed: > > -FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler movq > -FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler-not pxor > -FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-not pxor > -FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-times > (?n)(?:mov|psrldq).*%xmm[0-9] 12 > > So, the compiler still expects vec_concat/vec_select patterns to be present. v2df foo_v2df (v2df x) { return __builtin_shuffle (x, (v2df) { 0, 0 }, (v2di) { 0, 2 }); } The testcase is not a typical vec_merge case, for vec_merge, the shuffle index should be {0, 3}. Here it happened to be a vec_merge because the second vector is all zero. And yes for this case, we still need to vec_concat:vec_select pattern.
[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #3 from Hongtao.liu --- in x86 backend expand_vec_perm_1, we always tries vec_merge frist for !one_operand_p, expand_vselect_vconcat is only tried when vec_merge failed which means we'd better to use vec_merge instead of vec_select:vec_concat when available in out backend pattern match. Also for the view of avx512 kmask instructions, use vec_merge will help constant propagation. 20107 /* Try the SSE4.1 blend variable merge instructions. */ 20108 if (expand_vec_perm_blend (d)) 20109return true; 20110 20111 /* Try movss/movsd instructions. */ 20112 if (expand_vec_perm_movs (d)) 20113return true;
[Bug target/111064] 5-10% regression of parest on icelake between g:d073e2d75d9ed492de9a8dc6970e5b69fae20e5a (Aug 15 2023) and g:9ade70bb86c8744f4416a48bb69cf4705f00905a (Aug 16)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111064 --- Comment #3 from Hongtao.liu --- I didn't find the any regression when testing the patch. Guess it's because my tester is full-copy run and the options are -march=native -Ofast -flto -funroll-loop. Let me verify it.
[Bug target/111062] ICE: in final_scan_insn_1, at final.cc:2808 could not split insn {*andndi_1} with -O -mavx10.1-256 -mavx512bw -mno-avx512f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111062 --- Comment #1 from Hongtao.liu --- (In reply to Zdenek Sojka from comment #0) > Created attachment 55755 [details] > reduced testcase > > Compiler output: > $ x86_64-pc-linux-gnu-gcc -O -mavx10.1-256 -mavx512bw -mno-avx512f testcase.c > cc1: warning: > '-mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq}' > are ignored with '-mavx10.1' and above Warning message can be a little confusing. A better formulation might be: with avx10.1 enabled, -mno-avx512f does not fully disable AVX512-related instructions.
[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966 --- Comment #4 from Hongtao.liu --- (In reply to anlauf from comment #3) > (In reply to Hongtao.liu from comment #2) > > (In reply to Richard Biener from comment #1) > > > I think matmul is fine with avx512f or avx, so requiring/using only the > > > base > > > ISA level sounds fine to me. > > > > Could be potential miss-optimization. > > Do you mean a missed optimzation? > > Or really wrong code? a missed optimzation.
[Bug target/110979] New: Miss-optimization for O2 fully masked loop on floating point reduction.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110979 Bug ID: 110979 Summary: Miss-optimization for O2 fully masked loop on floating point reduction. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- https://godbolt.org/z/YsaesW8zT float foo3 (float* __restrict a, int n) { float sum = 0.0f; for (int i = 0; i != 100; i++) sum += a[i]; return sum; } -O2 -march=znver4 --param vect-partial-vector-usage=2, we get [local count: 66437776]: # sum_13 = PHI # loop_mask_16 = PHI <_54(3), { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 }(2)> # ivtmp.13_12 = PHI # ivtmp.16_2 = PHI # DEBUG i => NULL # DEBUG sum => NULL # DEBUG BEGIN_STMT _4 = (void *) ivtmp.13_12; _11 = &MEM [(float *)_4]; vect__4.6_17 = .MASK_LOAD (_11, 32B, loop_mask_16); cond_18 = .VCOND_MASK (loop_mask_16, vect__4.6_17, { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }); stmp_sum_10.7_19 = BIT_FIELD_REF ; stmp_sum_10.7_20 = sum_13 + stmp_sum_10.7_19; stmp_sum_10.7_21 = BIT_FIELD_REF ; stmp_sum_10.7_22 = stmp_sum_10.7_20 + stmp_sum_10.7_21; stmp_sum_10.7_23 = BIT_FIELD_REF ; stmp_sum_10.7_24 = stmp_sum_10.7_22 + stmp_sum_10.7_23; stmp_sum_10.7_25 = BIT_FIELD_REF ; stmp_sum_10.7_26 = stmp_sum_10.7_24 + stmp_sum_10.7_25; stmp_sum_10.7_27 = BIT_FIELD_REF ; stmp_sum_10.7_28 = stmp_sum_10.7_26 + stmp_sum_10.7_27; stmp_sum_10.7_29 = BIT_FIELD_REF ; stmp_sum_10.7_30 = stmp_sum_10.7_28 + stmp_sum_10.7_29; stmp_sum_10.7_31 = BIT_FIELD_REF ; stmp_sum_10.7_32 = stmp_sum_10.7_30 + stmp_sum_10.7_31; stmp_sum_10.7_33 = BIT_FIELD_REF ; stmp_sum_10.7_34 = stmp_sum_10.7_32 + stmp_sum_10.7_33; stmp_sum_10.7_35 = BIT_FIELD_REF ; stmp_sum_10.7_36 = stmp_sum_10.7_34 + stmp_sum_10.7_35; stmp_sum_10.7_37 = BIT_FIELD_REF ; stmp_sum_10.7_38 = stmp_sum_10.7_36 + stmp_sum_10.7_37; stmp_sum_10.7_39 = BIT_FIELD_REF ; stmp_sum_10.7_40 = stmp_sum_10.7_38 + stmp_sum_10.7_39; stmp_sum_10.7_41 = BIT_FIELD_REF ; stmp_sum_10.7_42 = stmp_sum_10.7_40 + stmp_sum_10.7_41; stmp_sum_10.7_43 = BIT_FIELD_REF ; stmp_sum_10.7_44 = stmp_sum_10.7_42 + stmp_sum_10.7_43; stmp_sum_10.7_45 = BIT_FIELD_REF ; stmp_sum_10.7_46 = stmp_sum_10.7_44 + stmp_sum_10.7_45; stmp_sum_10.7_47 = BIT_FIELD_REF ; stmp_sum_10.7_48 = stmp_sum_10.7_46 + stmp_sum_10.7_47; stmp_sum_10.7_49 = BIT_FIELD_REF ; sum_10 = stmp_sum_10.7_48 + stmp_sum_10.7_49; # DEBUG sum => sum_10 # DEBUG BEGIN_STMT # DEBUG i => NULL # DEBUG sum => sum_10 # DEBUG BEGIN_STMT _53 = {ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2, ivtmp.16_2}; _54 = _53 > { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }; ivtmp.13_15 = ivtmp.13_12 + 64; ivtmp.16_3 = ivtmp.16_2 + 240; if (ivtmp.16_3 != 228) Looks like an cost model issue? For aarch64, it looks fine since they have FADDA(Floating-point add strictly-ordered reduction, accumulating in scalar).
[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966 --- Comment #2 from Hongtao.liu --- (In reply to Richard Biener from comment #1) > I think matmul is fine with avx512f or avx, so requiring/using only the base > ISA level sounds fine to me. Could be potential miss-optimization.
[Bug libfortran/110966] New: should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966 Bug ID: 110966 Summary: should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libfortran Assignee: unassigned at gcc dot gnu.org Reporter: crazylht at gmail dot com Target Milestone: --- In libgfortran/m4/matmul.m4, we have #ifdef HAVE_AVX512F 'define(`matmul_name',`matmul_'rtype_code`_avx512f')dnl `static void 'matmul_name` ('rtype` * const restrict retarray, 'rtype` * const restrict a, 'rtype` * const restrict b, int try_blas, int blas_limit, blas_call gemm) __attribute__((__target__("avx512f"))); static' include(matmul_internal.m4)dnl `#endif /* HAVE_AVX512F */ But target ("avx512f") only enable -mavx512f which has quite limited capability of AVX512. Since now we have arch level, should we use target("arch=x86-64-v4") instead.
[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926 --- Comment #10 from Hongtao.liu --- Fixed in GCC14.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 --- Comment #11 from Hongtao.liu --- (In reply to 罗勇刚(Yonggang Luo) from comment #10) > (In reply to Hongtao.liu from comment #9) > > > > Without `-mbmi` option, gcc can not compile and all other three compiler > > > can compile. > > > > As long as it keeps semantics(respect zero input), I think this is > > acceptable. > > Yeap, it's acceptable, but consistence with Clang/MSVC/ICL would be better. > That would makes the cross-platform code easier, besides, GCC also works for > WIN32, that's needs GCC to be consistence with MSVC Sorry for confusion, I meant generating codes like f(int, int): # @f(int, int) testedi, edi je .LBB0_2 rep bsf eax, edi ret .LBB0_2: mov eax, 32 ret w/o mbmi is acceptable as long as it respect zero input.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 --- Comment #9 from Hongtao.liu --- > There is a redundant xor instrunction, There's false dependence issue on some specific processors. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 > Without `-mbmi` option, gcc can not compile and all other three compiler > can compile. As long as it keeps semantics(respect zero input), I think this is acceptable.
[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926 --- Comment #8 from Hongtao.liu --- (In reply to Alexander Monakov from comment #7) > Thanks for identifying the problem. Please don't rename the argument to > 'op_mask' though: the parameter itself is not a mask, it's an eight-bit > control word of the vpternlog instruction (holding the logic table of a > three-operand Boolean function). The function derives a three-bit mask from > it. I'll rename it as ternlog_imm8 to avoid confusion.
[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926 --- Comment #6 from Hongtao.liu --- (In reply to Hongtao.liu from comment #5) > I'm working on a patch. int -vpternlog_redundant_operand_mask (rtx *operands) +vpternlog_redundant_operand_mask (rtx op_mask) { int mask = 0; - int imm8 = XINT (operands[4], 0); + int imm8 = INTVAL (op_mask); We should use INTVAL instead of XINT.
[Bug target/110926] [14 regression] Bootstrap failure (matmul_i1.c:1781:1: internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' (rtx const_int) in vpternlog_redundant_operand_m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110926 --- Comment #5 from Hongtao.liu --- I'm working on a patch.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 --- Comment #7 from Hongtao.liu --- (In reply to 罗勇刚(Yonggang Luo) from comment #6) > MSVC also added, clang seems have optimization issue, but MSVC doesn't have > that No, I think what clang does is correct, f(int, int): # @f(int, int) testedi, edi --- when source operand is zero. je .LBB0_2 rep bsf eax, edi ret .LBB0_2: mov eax, 32 ret The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. https://godbolt.org/z/s74dfdWP4
[Bug target/105504] Fails to break dependency for vcvtss2sd xmm, xmm, mem
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105504 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #8 from Hongtao.liu --- (In reply to Eric Gallager from comment #7) > (In reply to CVS Commits from comment #6) > > The master branch has been updated by hongtao Liu : > > > > https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > > > > commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > > Author: liuhongt > > Date: Mon May 30 15:30:51 2022 +0800 > > > > Disparages SSE_REGS alternatives sligntly with ?v instead of *v in > > *mov{si,di}_internal. > > > > So alternative v won't be igored in record_reg_classess. > > > > Similar for *r alternatives in some vector patterns. > > > > It helps testcase in the PR, also RA now makes better decisions for > > gcc.target/i386/extract-insert-combining.c > > > > movd%esi, %xmm0 > > movd%edi, %xmm1 > > - movl%esi, -12(%rsp) > > paddd %xmm0, %xmm1 > > pinsrd $0, %esi, %xmm0 > > paddd %xmm1, %xmm0 > > > > The patch has no big impact on SPEC2017 for both O2 and Ofast > > march=native run. > > > > And I noticed there's some changes in SPEC2017 from code like > > > > mov mem, %eax > > vmovd %eax, %xmm0 > > .. > > mov %eax, 64(%rsp) > > > > to > > > > vmovd mem, %xmm0 > > .. > > vmovd %xmm0, 64(%rsp) > > > > Which should be exactly what we want? > > > > gcc/ChangeLog: > > > > PR target/105513 > > PR target/105504 > > * config/i386/i386.md (*movsi_internal): Change alternative > > from *v to ?v. > > (*movdi_internal): Ditto. > > * config/i386/sse.md (vec_set_0): Change alternative *r > > to ?r. > > (*vec_extractv4sf_mem): Ditto. > > (*vec_extracthf): Ditto. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/pr105513-1.c: New test. > > * gcc.target/i386/extract-insert-combining.c: Add new > > scan-assembler-not for spill. > > Did this fix it? Yes.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 --- Comment #5 from Hongtao.liu --- Maybe source code can be changed as int f(int a, int b) { #ifdef __BMI__ return _tzcnt_u32 (a); #else return _bit_scan_forward (a); #endif } But looks like clang/MSVC doesn't support _bit_scan_forward, should be a bug for them since it's in the intrinsics guide.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 --- Comment #4 from Hongtao.liu --- (In reply to Hongtao.liu from comment #3) > But there's difference between TZCNT and BSF > > The key difference between TZCNT and BSF instruction is that TZCNT provides > operand size as output when source operand is zero while in the case of BSF > instruction. > > Clang looks correct since it also handle zero case, ICC seems wrong, it just > generates > https://godbolt.org/z/WvrsTrjWr MSCV seems wrong either.
[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #3 from Hongtao.liu --- But there's difference between TZCNT and BSF The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction. Clang looks correct since it also handle zero case, ICC seems wrong, it just generates https://godbolt.org/z/WvrsTrjWr
[Bug target/110762] [11/12/13 Regression] inappropriate use of SSE (or AVX) insns for v2sf mode operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762 --- Comment #23 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #22) > It looks to me that partial vector half-float instructions have the same > issue. Yes, I'll take a look.
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #7 from Hongtao.liu --- > > to .VEC_ADDSUB possibly loses exceptions (the vectorizer now directly > creates .VEC_ADDSUB when possible). Let's put it under -fno-trapping-math.
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #5 from Hongtao.liu --- (In reply to Richard Biener from comment #1) > Hmm, I think the issue is we see > > f (__m128d x, __m128d y, __m128d z) > { > vector(2) double _4; > vector(2) double _6; > >[100.00%]: > _4 = x_2(D) * y_3(D); > _6 = __builtin_ia32_addsubpd (_4, z_5(D)); [tail call] We can fold the builtin into .VEC_ADDSUB, and optimize MUL + VEC_ADDSUB -> VEC_FMADDSUB in match.pd?
[Bug target/81904] FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 --- Comment #4 from Hongtao.liu --- (In reply to Richard Biener from comment #2) > __m128d h(__m128d x, __m128d y, __m128d z){ > __m128d tem = _mm_mul_pd (x,y); > __m128d tem2 = tem + z; > __m128d tem3 = tem - z; > return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); > } > > doesn't quite work (the combiner pattern for fmaddsub is missing). Tried > {0, 2} as well. > > : > .LFB5021: > .cfi_startproc > vmovapd %xmm0, %xmm3 > vfmsub132pd %xmm1, %xmm2, %xmm0 > vfmadd132pd %xmm1, %xmm2, %xmm3 > vshufpd $2, %xmm0, %xmm3, %xmm0 tem2_6 = .FMA (x_2(D), y_3(D), z_5(D)); # DEBUG tem2 => tem2_6 # DEBUG BEGIN_STMT tem3_7 = .FMS (x_2(D), y_3(D), z_5(D)); # DEBUG tem3 => NULL # DEBUG BEGIN_STMT _8 = VEC_PERM_EXPR ; Can it be handled in match.pd? rewrite fmaddsub pattern into vec_merge fma fms looks too complex. Similar for VEC_ADDSUB + MUL -> VEC_FMADDSUB.
[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #9 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #8) > (In reply to Richard Biener from comment #6) > > Do we know whether we could in theory improve the sanitizing by optimization > > without -funsafe-math-optimizations (I think -fno-trapping-math, > > -ffinite-math-only -fno-signalling-nans should be a better guard?)? > > Regarding the sanitizing, we can remove all sanitizing MOVQ instructions > between trapping instructions (IOW, the result of ADDPS is guaranteed to > have zeros in the high part outside V2SF, so MOVQ is unnecessary in front of > a follow-up MULPS). > > I think that some instruction back-walking pass on the RTL insn stream would > be able to identify these unnecessary instructions and remove them. > V2SFmode operand can be produced by direct patterns or SUBREG, I'm thinking about only sanitizing those V2SFmode operations when there's a subreg in source operand and make sure every other patterns which set V2SFmode dest will clear upper bits.(inlucde mov_internal,vec_concatv2sf_sse4_1,sse_storehps,sse_storehps,*vec_concatv2sf_sse) for mov_internal, we can just set alternative (v,v) with mode DI, then it will use vmovq, for other alternatives which set sse_regs, the instructions has already cleared the upper bits. For vec_concatv2sf_sse4_1/sse_storehps/sse_storehps/*vec_concatv2sf_sse, we can change them into define_insn_and_split, splitting into a V4SF instruction(like we did for those V2SFmode patterns), and use SUBREG for the dest or explicitly sanitizing the dest. BTW looks like *vec_concatv2df_sse4_1 can be merged into *vec_concatv2sf_sse