[Bug rtl-optimization/115021] [14 regression] unnecessary spill for vpternlog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #8 from Hongtao Liu --- Fixed in GCC15.
[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Hongtao Liu --- Fixed in GCC15.
[Bug tree-optimization/89749] Very odd vector constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89749 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Known to work||12.1.0 CC||liuhongt at gcc dot gnu.org Status|NEW |RESOLVED --- Comment #6 from Hongtao Liu --- Fixed in GCC12 and above.
[Bug target/113744] Unnecessary "m" constraint in *adddi_4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Hongtao Liu --- Fixed in GCC15.
[Bug target/115981] [14/15 Regression] Redundant vmovaps to itself after vmovups since r14-537
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115981 --- Comment #4 from Hongtao Liu --- (In reply to Jakub Jelinek from comment #3) > Created attachment 58786 [details] > gcc15-pr115981.patch > > Untested fix. As since that commit it checks swap_commutative_operands_p: > 1) CONST_VECTOR I think has commutative_operand_precedence -4 > 2) REG has commutative_operand_precedence -1 or -2 > 3) SUBREG of object has commutative_operand_precedence -3 > 4) VEC_DUPLICATE has commutative_operand_precedence 0 > Which means the VEC_DUPLICATE operand will always come first and whatever > matches reg_or_0_operand will always come second, i.e. exactly not the order > in the pattern, so we don't need to add another one, can just change order > of this one. Patch LGTM.
[Bug target/116122] [14/15 regression] __FLT16_MAX__ is defined even with -mno-sse2 on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116122 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #5 from Hongtao Liu --- Mentioned in GCC14 "Changes" and "Porting to" documentation.
[Bug target/85236] missing _mm256_atan2_ps
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85236 Hongtao Liu changed: What|Removed |Added CC||binklings at 163 dot com --- Comment #8 from Hongtao Liu --- *** Bug 116157 has been marked as a duplicate of this bug. ***
[Bug target/116157] AVX2 _mm256_exp_ps function is missing in the compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116157 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Hongtao Liu --- Don't have plan to support it in GCC. *** This bug has been marked as a duplicate of bug 85236 ***
[Bug target/113744] Unnecessary "m" constraint in *adddi_4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |lingling.kong7 at gmail dot com Ever confirmed|0 |1 Last reconfirmed|2024-02-04 00:00:00 |2024-07-31 --- Comment #4 from Hongtao Liu --- Then please remove constraint from the pattern.
[Bug target/116122] [14/15 regression] __FLT16_MAX__ is defined even with -mno-sse2 on 32-bit x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116122 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-07-29 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org --- Comment #4 from Hongtao Liu --- > If this gcc change will not be reverted, it should be documented as a change > in the gcc 14 "Changes" and "Porting to" documentation. I'll add some documents for that.
[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096 --- Comment #3 from Hongtao Liu --- > > (define_insn "ashl3_doubleword" >[(set (match_operand:DWI 0 "register_operand" "=,") > - (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") > + (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r") > (match_operand:QI 2 "nonmemory_operand" "c,c"))) > (clobber (reg:CC FLAGS_REG))] >"" The patch is incomplete, it should also support integer 1 since pm1_operand means 1 or -1.
[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096 Hongtao Liu changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org Status|NEW |ASSIGNED
[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- (In reply to Andrew Pinski from comment #1) > This is interesting. > > After reload we have: > ``` > (insn 450 93 97 2 (set (reg:QI 2 cx [521]) > (reg:QI 38 r10 [521])) "/app/example.cpp":11:13 91 {*movqi_internal} > (nil)) > (insn 97 450 385 2 (parallel [ > (set (reg:TI 4 si [orig:337 _32 ] [337]) > (ashift:TI (const_int 1671291085 [0x639de0cd]) > (reg:QI 2 cx [521]))) > (clobber (reg:CC 17 flags)) > ]) "/app/example.cpp":11:13 953 {ashlti3_doubleword} > (expr_list:REG_EQUIV (mem:TI (plus:DI (reg/f:DI 19 frame) > (const_int -80 [0xffb0])) [2 S16 A128]) > (expr_list:REG_EQUAL (ashift:TI (const_int 1671291085 [0x639de0cd]) > (reg:QI 38 r10 [521])) > (nil > ``` It should be already invalid insn after reload since 1671291085 is not reg_or_pm1_operand, guess reload have't check predicate, but only check for constaint? 14775(define_insn "ashl3_doubleword" 14776 [(set (match_operand:DWI 0 "register_operand" "=,") 14777(ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") 14778(match_operand:QI 2 "nonmemory_operand" "c,c"))) Before reload it's ok I'm testing below which can fix the issue. 3537(insn 98 94 387 2 (parallel [ 3538(set (reg:TI 337 [ _32 ]) 3539(ashift:TI (reg:TI 329) 3540(reg:QI 521))) 3541(clobber (reg:CC 17 flags)) 3542]) "test.c":11:13 953 {ashlti3_doubleword} diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md index 7508d7a58bd..e1d70162b88 100644 --- a/gcc/config/i386/constraints.md +++ b/gcc/config/i386/constraints.md @@ -225,9 +225,8 @@ (define_constraint "Bz" (define_constraint "BC" "@internal integer SSE constant with all bits set operand." - (and (match_test "TARGET_SSE") - (ior (match_test "op == constm1_rtx") - (match_operand 0 "vector_all_ones_operand" + (ior (match_test "op == constm1_rtx") + (match_operand 0 "vector_all_ones_operand"))) (define_constraint "BF" "@internal floating-point SSE constant with all bits set operand." diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 6207036a2a0..9c4e847fba1 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1" (define_insn "ashl3_doubleword" [(set (match_operand:DWI 0 "register_operand" "=,") - (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r") + (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r") (match_operand:QI 2 "nonmemory_operand" "c,c"))) (clobber (reg:CC FLAGS_REG))] ""
[Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #5 from Hongtao Liu --- Just note, with -mapxf, gcc now generates cmp edx, 5 setzune dl
[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978 --- Comment #10 from Hongtao Liu --- (In reply to H.J. Lu from comment #9) > (In reply to Hongtao Liu from comment #8) > > Fixed in GCC15,thanks H.J. > > Does GCC 14 have the same issue with -m32 -march=native? Yes, will backport the patch.
[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|WAITING |RESOLVED --- Comment #8 from Hongtao Liu --- Fixed in GCC15,thanks H.J.
[Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 --- Comment #48 from Hongtao Liu --- (In reply to Hongtao Liu from comment #47) > Created attachment 58746 [details] > Accoate v2di with GPR > > The attached patch can allocated V2DI with GPR to avoid spill. > @Uros Is it a good idea to make GPR available for all 128-bit vector with 1) extend *movti_internal to all 128-bit vectors, extend related splitter to handle movement between GPR and SSE_REG, extend split_double_mode to handle movement between GPR and GPR 2) Adjust ix86_hard_regno_mode_ok to make GPR available for all 128-bit vector 3) inline_secondary_memory_needed need to be adjust since now we support movement between GPR and SSE for 16-bytes vector.
[Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #47 from Hongtao Liu --- Created attachment 58746 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58746=edit Accoate v2di with GPR The attached patch can allocated V2DI with GPR to avoid spill. poly_double_le2: .LFB0: .cfi_startproc movq%rdi, %rdx movq8(%rsi), %rdi movq(%rsi), %rsi movq%rdi, %rax movq%rsi, %rcx vmovq %rsi, %xmm4 sarq$63, %rax shrq$63, %rcx vpinsrq $1, %rdi, %xmm4, %xmm3 andl$135, %eax vpsllq $1, %xmm3, %xmm1 vmovq %rax, %xmm2 vpinsrq $1, %rcx, %xmm2, %xmm0 vpxor %xmm1, %xmm0, %xmm0 vmovdqu %xmm0, (%rdx) ret .cfi_endproc But when there's (subreg:V (reg:TI 0)) for other vector modes, the issue could be still there.
[Bug c++/116064] [15 Regression] SPEC 2017 523.xalancbmk_r failed to build
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116064 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- But does GCC have a walkaround similar as -fdelayed-template-parsing in Clang?
[Bug target/116043] [15 regression] TLS relocation issue when building glibc with -O3 -mavx512bf16 by r15-1619
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116043 Hongtao Liu changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #15 from Hongtao Liu --- > I think we can exclude case when base and index are both NULL_RTX, let's > always use *mov{si,di}_internal pattern to move const to register. No, i misunderstood the issue, it's not the problem of lea pattern, it's the address of gottpoff shouldn't be reloaded. In PR103275, r12-5445-gb5844cb0bc8c7d9be2ff1ecded249cad82b9b71c added new constraint "Bk" to avoid kmovqfoo@gottpoff(%rip), %k0, but RA may still allocates k/v register and try to reload for address since it thought the cost of reload address is cheap? Adjust "Bk" to define_special_memory_constraint to avoid address reload can solve the issue. modified gcc/config/i386/constraints.md @@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm" "@internal Vector memory operand." (match_operand 0 "vector_memory_operand")) -(define_memory_constraint "Bk" +(define_special_memory_constraint "Bk" "@internal TLS address that allows insn using non-integer registers." (and (match_operand 0 "memory_operand") (not (match_test "ix86_gpr_tls_address_pattern_p (op)" I'm testing the patch.
[Bug target/116043] [15 regression] TLS relocation issue when building glibc with -O3 -mavx512bf16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116043 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #11 from Hongtao Liu --- The buggy insn is like (insn:TI 348 525 527 5 (set (reg:DI 4 si [156]) (const:DI (unspec:DI [ (symbol_ref:DI ("__libc_tsd_CTYPE_B") [flags 0x60] ) ] UNSPEC_GOTNTPOFF))) "/app/example.c":12:37 discrim 1 258 {*leadi} (nil)) - -define_insn is like-- 6276(define_insn "*lea" 6277 [(set (match_operand:SWI48 0 "register_operand" "=r") 6278(match_operand:SWI48 1 "address_no_seg_operand" "Ts"))] 6279 "ix86_hardreg_mov_ok (operands[0], operands[1])" 1346;; Return true if op is a valid address for LEA, and does not contain 1347;; a segment override. Defined as a special predicate to allow 1348;; mode-less const_int operands pass to address_operand. 1349(define_special_predicate "address_no_seg_operand" 1350 (match_test "address_operand (op, VOIDmode)") 1351{ 1352 struct ix86_address parts; 1353 int ok; 1354 1355 if (!CONST_INT_P (op) 1356 && mode != VOIDmode 1357 && GET_MODE (op) != mode) 1358return false; 1359 1360 ok = ix86_decompose_address (op, ); 1361 gcc_assert (ok); 1362 return parts.seg == ADDR_SPACE_GENERIC; 1363}) --define_insn ends I think we can exclude case when base and index are both NULL_RTX, let's always use *mov{si,di}_internal pattern to move const to register.
[Bug target/115982] [15 Regression] ICE: unrecognizable insn in ira_remove_insn_scratches with -mavx512vl since r15-1742
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115982 --- Comment #5 from Hongtao Liu --- Fixed by r15-2217-ga3f03891065cb9, could be latent on release branch since GCC12
[Bug target/115982] [15 Regression] ICE: unrecognizable insn in ira_remove_insn_scratches with -mavx512vl since r15-1742
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115982 Hongtao Liu changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org --- Comment #4 from Hongtao Liu --- I'll take a look
[Bug tree-optimization/115994] Vectorizer failed to do vectorizaton for .sat_trunc when nunits_in / nunits_out > 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115994 --- Comment #1 from Hongtao Liu --- Also in vect_recog_sat_trunc_pattern 4700 tree v_itype = get_vectype_for_scalar_type (vinfo, itype); 4701 tree v_otype = get_vectype_for_scalar_type (vinfo, otype); 4702 internal_fn fn = IFN_SAT_TRUNC; 4703 4704 if (v_itype != NULL_TREE && v_otype != NULL_TREE 4705&& direct_internal_fn_supported_p (fn, tree_pair (v_otype, v_itype), 4706 OPTIMIZE_FOR_BOTH)) 4707{ 4708 gcall *call = gimple_build_call_internal (fn, 1, ops[0]); 4709 tree out_ssa = vect_recog_temp_ssa_var (otype, NULL); it's supposed to check for something like sstruncv8siv8hi2, but it actually checks for sstruncv8siv16hi2 since get_vectype_for_scalar_type return same-size vector type not same-nunit vector type.
[Bug tree-optimization/115994] New: Vectorizer failed to do vectorizaton for .sat_trunc when nunits_in / nunits_out > 2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115994 Bug ID: 115994 Summary: Vectorizer failed to do vectorizaton for .sat_trunc when nunits_in / nunits_out > 2 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Target Milestone: --- in vectorizable_call 3324 nunits_in = TYPE_VECTOR_SUBPARTS (vectype_in); 3325 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out); 3326 if (known_eq (nunits_in * 2, nunits_out)) 3327modifier = NARROW; 3328 else if (known_eq (nunits_out, nunits_in)) 3329modifier = NONE; 3330 else if (known_eq (nunits_out * 2, nunits_in)) 3331modifier = WIDEN; 3332 else return false; x86 AVX512 supports vpmovusqb/vpmovusqw/vpmovusdb, since current vectorizer will keep same vector length, then nunits_in / nunits_out will be greater than 2 and failed vectorization for .sat_trunc.
[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978 --- Comment #6 from Hongtao Liu --- (In reply to H.J. Lu from comment #5) > (In reply to Hongtao Liu from comment #4) > > To clarify, the question originally came from whether or not to report error > > for -m32,-march=native, and then LLVM folks said it's diffcult for LLVM not > > issuing error for -march=native -m32, but issuing error for explicit -mapxf > > -m32. So they want to just not issue error at all, and then comipler > > silently disables the 64-bit only features(plus adding documents to mention > > -m32 will disable those features). > > This is no different from PR 101395. I don't believe LLVM can't work like > GCC. I prefer your fix, I'll bring this to LLVM folks to rediscuss.
[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978 --- Comment #4 from Hongtao Liu --- To clarify, the question originally came from whether or not to report error for -m32,-march=native, and then LLVM folks said it's diffcult for LLVM not issuing error for -march=native -m32, but issuing error for explicit -mapxf -m32. So they want to just not issue error at all, and then comipler silently disables the 64-bit only features(plus adding documents to mention -m32 will disable those features).
[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966 --- Comment #5 from Hongtao Liu --- I saw pass_eras optimize BIT_FIELD_REF of big memory into load from small memory Created a replacement for D.161366 offset: 0, size: 64: SR.20D.170101 Created a replacement for D.161366 offset: 64, size: 64: SR.21D.170102 Created a replacement for D.161366 offset: 128, size: 64: SR.22D.170103 Created a replacement for D.161547 offset: 0, size: 256: SR.23D.170104 _8 = BIT_FIELD_REF ; _9 = BIT_FIELD_REF ; _10 = BIT_FIELD_REF ; _11 = {0, _8, _9, _10}; to SR.20_3 = MEM [(struct simd *)]; SR.21_13 = MEM [(struct simd *) + 8B]; SR.22_14 = MEM [(struct simd *) + 16B]; _7 = SR.20_3; _8 = SR.21_13; _9 = SR.22_14; _10 = {0, _7, _8, _9}; So I guess for the later GCC somehow can't be sure the whole 256-bit memory is valid and fail to optimize it with vec_perm_expr?
[Bug middle-end/115863] [15 Regression] zlib-1.3.1 miscompilation since r15-1936-g80e446e829d818
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115863 Hongtao Liu changed: What|Removed |Added CC||lin1.hu at intel dot com --- Comment #16 from Hongtao Liu --- > Unfortunately, x86 has no vector mode .SAT_TRUNC instruction. No, AVX512 supports both signed and unsigned saturation vpmovsdb:vpmovusdb vpmovsdw:vpmovusdw vpmovsqb:vpmovusqb vpmovsqd:vpmovusqd vpmovsqw:vpmovusqw vpmovswb:vpmovuswb vpmovsdb:vpmovusdb and we're working on a patch to support that.
[Bug target/113711] APX instruction set and instructions longer than 15 bytes (assembly warning)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113711 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED CC||liuhongt at gcc dot gnu.org --- Comment #12 from Hongtao Liu --- Fixed in GCC14.
[Bug target/113733] Invalid APX TLS code squence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113733 Bug 113733 depends on bug 113711, which changed state. Bug 113711 Summary: APX instruction set and instructions longer than 15 bytes (assembly warning) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113711 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/115843] [14/15 Regression] 531.deepsjeng_r fails to verify with -O3 -march=znver4 --param vect-partial-vector-usage=2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115843 --- Comment #10 from Hongtao Liu --- > But using kmovw for QImode mask is not correct as we don't know the value in > gpr. Perhaps we'd consider restrict the kmovb under avx512dq only. Why? as long as we only care about lower 8 bits, vmovw should be fine.
[Bug tree-optimization/115843] [14/15 Regression] 531.deepsjeng_r fails to verify with -O3 -march=znver4 --param vect-partial-vector-usage=2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115843 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #9 from Hongtao Liu --- Observed one miss-optimization: kxorw %k4, %k4, %k4 # 262 [c=4 l=4] *movqi_internal/14 vmovdqu64 %zmm0, KingPressureMask1-120(%rip){%k4} # 44 [c=65 l=10] avx512f_storev8di_mask vmovdqu64 %zmm0, KingPressureMask1-56(%rip){%k4}# 47 [c=65 l=10] avx512f_storev8di_mask when mask is 0, maskstore can be optimized off.
[Bug tree-optimization/115872] [12/13/14/15 regression] ICE in fab pass (error: missing definition with -g & -O3)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115872 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #6 from Hongtao Liu --- Fixed in GCC12.5, GCC13.4, GCC14.2 and main trunk.
[Bug target/115889] [15 Regression] FAIL: gcc.dg/vect/vect-vfa-03.c execution test with -march=znver4 --param vect-partial-vector-usage=1 since r15-1368-g6d0b7b69d14302
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115889 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED CC||liuhongt at gcc dot gnu.org --- Comment #9 from Hongtao Liu --- Fixed in GCC15.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 115889, which changed state. Bug 115889 Summary: [15 Regression] FAIL: gcc.dg/vect/vect-vfa-03.c execution test with -march=znver4 --param vect-partial-vector-usage=1 since r15-1368-g6d0b7b69d14302 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115889 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842 --- Comment #3 from Hongtao Liu --- (In reply to Hongtao Liu from comment #2) > Bisected to r15-1673-gb8153b5417bed0, the commit fixed wrong rtx_cost of > r15-882-g1d6199e5f8c1c0 which happened to improved 548.exchange_r. Looks like wrong rtx_cost of mem somehow get better RA and has less spills in the hot loop.
[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|UNCONFIRMED Ever confirmed|1 |0 Last reconfirmed|2024-07-11 00:00:00 | Assignee|liuhongt at gcc dot gnu.org|unassigned at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- Bisected to r15-1673-gb8153b5417bed0, the commit fixed wrong rtx_cost of r15-882-g1d6199e5f8c1c0 which happened to improved 548.exchange_r.
[Bug tree-optimization/115872] [12/13/14/15 regression] ICE in fab pass (error: missing definition with -g & -O3)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115872 Hongtao Liu changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Hongtao Liu --- Mine.
[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842 Hongtao Liu changed: What|Removed |Added Last reconfirmed||2024-07-11 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |liuhongt at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Hongtao Liu --- I'll take a look.
[Bug tree-optimization/115833] SLP of signed short multiply goes wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833 Hongtao Liu changed: What|Removed |Added CC||lin1.hu at intel dot com --- Comment #4 from Hongtao Liu --- > is a bit odd for the packing. Possibly the target lacks a truncv4siv4hi > operation (thus the explicit zero vector). Possibly x86 lacks a > pack-lowpart/pack-highpart insn. We support truncv4siv4hi2 under AVX2, w/o AVX512, it generates shufb. 15390(define_expand "trunc2" 15391 [(set (match_operand: 0 "register_operand") 15392(truncate: 15393 (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))] 15394 "TARGET_AVX2" 15395{ bar(unsigned int __vector(4)): vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip] ret w/o AVX2, it's lower to _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>; _13 = BIT_FIELD_REF <_12, 64, 0>; vec_pack_trunc_expr uses packusdw with upper 16-bit cleared. The optab can be extended to TARGET_SSSE3 which supports pshufb.
[Bug tree-optimization/115833] SLP of signed short multiply goes wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #3 from Hongtao Liu --- > It seems the very bad code generation is mostly from constructing the > V4HImode vectors going via GPRs with shifts and ORs. Possibly > constructing a V4SImode vector and then packing to V4HImode would be > better? void v4hi_contruct(signed short *t, signed short tt, short tt1) { t[0] = tt; t[1] = tt1; t[2] = tt1; t[3] = tt1; } void v4si_contruct(int *t, int tt, int tt2) { t[0] = tt; t[1] = tt2; t[2] = tt2; t[3] = tt2; } v4hi_contruct(short*, short, short): movzx eax, dx movzx esi, si mov rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rax sal rdx, 16 or rdx, rsi mov QWORD PTR [rdi], rdx ret v4si_contruct(int*, int, int): vmovd xmm2, edx vmovd xmm3, esi vpinsrd xmm1, xmm2, edx, 1 vpinsrd xmm0, xmm3, edx, 1 vpunpcklqdq xmm0, xmm0, xmm1 vmovdqu XMMWORD PTR [rdi], xmm0 ret both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse than v4si_construct, but v4hi_construct can be optimized to be a little more parallel via GPRs.
[Bug target/113312] Add __attribute__((no_callee_saved_registers)) for Intel FRED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113312 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #29 from Hongtao Liu --- .
[Bug target/113312] Add __attribute__((no_callee_saved_registers)) for Intel FRED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113312 --- Comment #28 from Hongtao Liu --- __attribute__((no_callee_saved_registers)) is added in GCC14.
[Bug target/113733] Invalid APX TLS code squence
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113733 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- Fixed in GCC14.
[Bug target/115115] [12/13/14/15 Regression] highway-1.0.7 wrong _mm_cvttps_epi32() constant fold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115115 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #16 from Hongtao Liu --- Fixed in GCC15.
[Bug target/115796] [15 Regression] build failure since double_u -> __double_u change
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115796 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongtao Liu --- Fixed in GCC15.
[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 Hongtao Liu changed: What|Removed |Added CC||haochen.jiang at intel dot com, ||liuhongt at gcc dot gnu.org --- Comment #10 from Hongtao Liu --- > One of the comments in PR 115756 was "I'd lean towards shift+add because for > example Intel E-cores have a slow imul.". However, my benchmarks suggest > that even on Intel Efficiency CPU cores the algorithm with 2 multiplication > instructions is faster. (I used the Process Lasso tool on Windows 11 to > force the benchmark to be run on an Efficiency CPU core). @haocheng, could you try the benchmark on our Sierra Forest machine? I'm ok to adjust rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3) if the performance test looks ok.
[Bug target/115755] mulx (with -mbmi2) does not show up with constant multiply
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115755 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #1 from Hongtao Liu --- mulx doesn't support imm operand, a register is still needed to put 123. mulq is used func/func1 should be ok.
[Bug target/115756] default tuning for x86_64 produces shifts for `*240`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756 --- Comment #3 from Hongtao Liu --- Current rtx_cost for imulq in generic_cost is COST_N_INSNS (4), make it as COST_N_INSNS (3) could generate imulq. {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ COSTS_N_INSNS (4), /* HI */ COSTS_N_INSNS (3), /* SI */ COSTS_N_INSNS (4), /* DI */
[Bug target/115748] [15 Regression] gcc.target/i386/avx512bw-pr70509.c SIGILL with -m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115748 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongtao Liu --- Fixed in GCC15
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #23 from Hongtao Liu --- (In reply to edison from comment #22) > for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg, > all p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if > comment out it all p-core usage will up to 60%. > > 607.cactuBSSN_s on i9-13900K > gcc 14.1 > > preENV_GOMP_CPU_AFFINITY = 0-23: 60.1 (-41.7 % slower) > # preENV_GOMP_CPU_AFFINITY = 0-23: 103 > > but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need > preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core > first). Because E-core run slower than P-core, if you bind the thread to each core, it prevents threads from migrating from the E-core to the P-core.
[Bug target/115748] [15 Regression] gcc.target/i386/avx512bw-pr70509.c SIGILL with -m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115748 Hongtao Liu changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2024-07-02 --- Comment #2 from Hongtao Liu --- We can add move that part into a separate function and add target attribute for that.
[Bug target/107432] __builtin_convertvector generates inefficient code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED CC||liuhongt at gcc dot gnu.org Status|NEW |RESOLVED --- Comment #13 from Hongtao Liu --- Fixed in GCC15.
[Bug target/114189] Target implements obsolete vcond{,u,eq} expanders
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114189 Bug 114189 depends on bug 115517, which changed state. Bug 115517 Summary: Fix x86 regressions after dropping uses of vcond{,u,eq}_optab https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #15 from Hongtao Liu --- Fixed.
[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 --- Comment #14 from Hongtao Liu --- regressions above SSE4.1 are fxed in GCC15, SSE2 regressions are tracked in PR115683
[Bug target/115610] -flate-combine disabled by default for x86 port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Hongtao Liu --- Fixed by r15-1735-ge62ea4fb8ffcab
[Bug tree-optimization/115693] 8 std::byte std::array comparison potential missed optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115693 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #6 from Hongtao Liu --- > > > > So it makes more sense to fix this in the optimization passes, instead of > > ad-hoc hack in libstdc++. > > > > But I'm not sure if there already exists a dup. > > Let's keep this bug for the above testcase(s). For test() the issue is > that even with SSE4.1 we don't seem to support ptest for V8QImode? With SSE4.1 and above, We can support cbranchv8qi(and other 32/64-bit vector) with pmovzxv8qiv8hi + cbranchv8hi.
[Bug middle-end/115675] [15 Regression] truncv4hiv4qi affect r14-1402-gd8545fb2c71683's optimization.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115675 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- (In reply to Richard Biener from comment #1) > so it's now SLP vectorized? Yes, the vectorization looks not reasonable. it used to be vectorized as v4qi vector CTOR + v4qi vector store. Now it's vectorized as v4hi vector CTOR + truncv4hiv4qi + v4qi vector store.
[Bug target/115683] New: SSE2 regressions after obselete of vcond{,u,eq}.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115683 Bug ID: 115683 Summary: SSE2 regressions after obselete of vcond{,u,eq}. Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Target Milestone: --- Whole failure list. g++: g++.target/i386/pr100637-1b.C -std=gnu++14 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++17 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++20 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++98 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++14 scan-assembler-times pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++17 scan-assembler-times pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++20 scan-assembler-times pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++98 scan-assembler-times pcmpeqw 2 g++: g++.target/i386/pr103861-1.C -std=gnu++14 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++17 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++20 scan-assembler-times pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++98 scan-assembler-times pcmpeqb 2 gcc: gcc.target/i386/pr88540.c scan-assembler minpd There're extra 1 pcmpeq instruction generated in below 3 testcase for comparison of GTU, x86 doesn't support native GTU comparison, but use psubusw + pcmpeq + pcmpeq, the second pcmpeq is used to negate the mask, and the negate can be eliminated in vcond{,u,eq} expander by just swapping if_true and if_else. g++: g++.target/i386/pr100637-1b.C g++.target/i386/pr100637-1w.C g++: g++.target/i386/pr103861-1.C This one maybe a little bit difficult, it's x86 specific floating point min/max{ps,pd} which is an exact match of a > b ? a : b, and not ieee-conformant. gcc: gcc.target/i386/pr88540.c scan-assembler minpd
[Bug target/115462] [15 regression] 416.gamess regressed 4-6% on x86_64 since r15-882-g1d6199e5f8c1c0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #6 from Hongtao Liu --- Fixed in GCC15.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 115462, which changed state. Bug 115462 Summary: [15 regression] 416.gamess regressed 4-6% on x86_64 since r15-882-g1d6199e5f8c1c0 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/115450] [15 Regression] cpu2017 502.gcc runtime miscompute on aarch64 with SVE since r15-1006-gd93353e6423eca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #5 from Hongtao Liu --- (In reply to Andrew Pinski from comment #1) > >[r15-1006-gd93353e6423eca] Do single-lane SLP discovery for reductions > > > Interesting because PR 115256 bisect it to an earlier patch. For PR 115256, the issue is fixed after adding -fno-strict-aliasing.
[Bug target/115610] -flate-combine disabled by default for x86 port
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org Last reconfirmed||2024-06-24 Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 --- Comment #1 from Hongtao Liu --- Thanks, I'll take a look.
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 --- Comment #7 from Hongtao Liu --- > > BTW, when assign -1 to vector(1) , should the upper bit be > cleared? Look like only 1 element boolean vector is cleared, but not > vector(2) . > If the upper bits are not cleared, both 2 cases are equal. diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc index 710d697c021..0f045f851d1 100644 --- a/gcc/fold-const.cc +++ b/gcc/fold-const.cc @@ -8077,7 +8077,7 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len, { tree itype = TREE_TYPE (TREE_TYPE (expr)); if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr)) - && TYPE_PRECISION (itype) <= BITS_PER_UNIT) + && TYPE_PRECISION (itype) < BITS_PER_UNIT) { /* This is the only case in which elements can be smaller than a byte. Element 0 is always in the lsb of the containing byte. */ Can fix this. It looks like it supposed to handle for itype *less than* but not *less equal* BITS_PER_UNIT?
[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 --- Comment #6 from Hongtao Liu --- (In reply to rguent...@suse.de from comment #5) > On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 > > > > --- Comment #4 from Hongtao Liu --- > > (In reply to rguent...@suse.de from comment #3) > > > On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 > > > > > > > > --- Comment #2 from Hongtao Liu --- > > > > (In reply to Richard Biener from comment #1) > > > > > Btw, I had opened PR115490 with my results for this already. Some > > > > > mitigation > > > > > should be from optimizing ISEL expansion to vcond_mask and I'd start > > > > > with > > > > > looking at some of the fallout from that side (note that might require > > > > > the backend reject not natively implemented vec_cmp via its operand 1 > > > > > predicate) > > > > > > > > w/o AVX512, vector integer comparison only supports EQ/GT, others > > > > comparison > > > > rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + > > > > eq + > > > > negative the vector mask) > > > > If we restrict the predicate of operand 1, would middle-end reject > > > > vectorization (or lower it to scalar version)? > > > > > > Richard suggests that we implement the "obvious" transforms like > > > inversion in the middle-end but if for example unsigned compares > > > are not supported the us_minus + eq + negative trick isn't on > > > that list. > > > > > > The main reason to restrict vec_cmp would be to avoid > > > a <= b ? c : d going with an unsupported vec_cmp but instead > > > do a > b ? d : c - the alternative is trying to fix this > > > on the RTL side via combine. I understand the non-native > > > > Yes, I have a patch which can fix most regressions via pattern match in > > combine. > > Still there is a situation that is difficult to deal with, mainly the > > optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only exists > > under > > sse4.1, w/o sse4.1, it takes 3 instructions (pand,pandn,por) to simulate the > > vcond_mask, and the combine matches up to 4 instructions, which makes it > > currently impossible to use the combine to recover those optimizations in > > the > > vcond{,u,eq}.i.e min/max. > > In the case of sse 4.1 and above, there is basically no regression anymore. > > Maybe it's possible to use a define_insn_and_split for blends w/o SSE 4.1? > That would allow combine matching the high-level blend operation and > we'd only lower it afterwards? The question is what we lose in > combinations of/into the loweredn pand/pandn/por of course. I'd rather live with those regressions since they're only existed below sse4.1. > > Maybe it's possible to catch the higher-level optimization (min/max) > on the GIMPLE level instead? For integral part, I believe the optimization is already there at gimple level. For floating point part, x86 {max,min}{ps,pd} is not ieee-conformant, it's a exact match of cond_expr a < b ? a : b (w/ consideration of -0.0 and NAN.)
[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 --- Comment #4 from Hongtao Liu --- (In reply to rguent...@suse.de from comment #3) > On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 > > > > --- Comment #2 from Hongtao Liu --- > > (In reply to Richard Biener from comment #1) > > > Btw, I had opened PR115490 with my results for this already. Some > > > mitigation > > > should be from optimizing ISEL expansion to vcond_mask and I'd start with > > > looking at some of the fallout from that side (note that might require > > > the backend reject not natively implemented vec_cmp via its operand 1 > > > predicate) > > > > w/o AVX512, vector integer comparison only supports EQ/GT, others comparison > > rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq + > > negative the vector mask) > > If we restrict the predicate of operand 1, would middle-end reject > > vectorization (or lower it to scalar version)? > > Richard suggests that we implement the "obvious" transforms like > inversion in the middle-end but if for example unsigned compares > are not supported the us_minus + eq + negative trick isn't on > that list. > > The main reason to restrict vec_cmp would be to avoid > a <= b ? c : d going with an unsupported vec_cmp but instead > do a > b ? d : c - the alternative is trying to fix this > on the RTL side via combine. I understand the non-native Yes, I have a patch which can fix most regressions via pattern match in combine. Still there is a situation that is difficult to deal with, mainly the optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only exists under sse4.1, w/o sse4.1, it takes 3 instructions (pand,pandn,por) to simulate the vcond_mask, and the combine matches up to 4 instructions, which makes it currently impossible to use the combine to recover those optimizations in the vcond{,u,eq}.i.e min/max. In the case of sse 4.1 and above, there is basically no regression anymore. the regression testcases w/o sse4.1 FAIL: g++.target/i386/pr100637-1b.C -std=gnu++14 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr100637-1b.C -std=gnu++17 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr100637-1b.C -std=gnu++20 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr100637-1b.C -std=gnu++98 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr100637-1w.C -std=gnu++14 scan-assembler-times pcmpeqw 2 FAIL: g++.target/i386/pr100637-1w.C -std=gnu++17 scan-assembler-times pcmpeqw 2 FAIL: g++.target/i386/pr100637-1w.C -std=gnu++20 scan-assembler-times pcmpeqw 2 FAIL: g++.target/i386/pr100637-1w.C -std=gnu++98 scan-assembler-times pcmpeqw 2 FAIL: g++.target/i386/pr103861-1.C -std=gnu++14 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr103861-1.C -std=gnu++17 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr103861-1.C -std=gnu++20 scan-assembler-times pcmpeqb 2 FAIL: g++.target/i386/pr103861-1.C -std=gnu++98 scan-assembler-times pcmpeqb 2 FAIL: gcc.target/i386/pr88540.c scan-assembler minpd
[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 --- Comment #2 from Hongtao Liu --- (In reply to Richard Biener from comment #1) > Btw, I had opened PR115490 with my results for this already. Some mitigation > should be from optimizing ISEL expansion to vcond_mask and I'd start with > looking at some of the fallout from that side (note that might require > the backend reject not natively implemented vec_cmp via its operand 1 > predicate) w/o AVX512, vector integer comparison only supports EQ/GT, others comparison rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq + negative the vector mask) If we restrict the predicate of operand 1, would middle-end reject vectorization (or lower it to scalar version)?
[Bug target/115517] New: Fix regression after dropping uses of vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517 Bug ID: 115517 Summary: Fix regression after dropping uses of vcond{,u,eq}_optab Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Depends on: 114189 Target Milestone: --- Target: x86_64-*-* i?86-*-* > I'd appreciate testing, I do not expect fallout for x86 or arm/aarch64. > > I know riscv doesn't implement any of the legacy optabs. But less > > maintained vector targets might need adjustments. > > > At GCC14, I tried to remove these expanders in the x86 backend, and it > regressed some testcases, mainly because of the optimizations we did > in ix86_expand_{int,fp}_vcond. > I've started testing your patch, it's possible that we still need to > move the ix86_expand_{int,fp}_vcond optimizations to the > middle-end(isel or match.pd)or add extra patterns to handle it at the > rtl pas_combine. These are new failures I got g++: g++.target/i386/avx-pr54700-1.C scan-assembler-not vpcmpgt[bdq] g++: g++.target/i386/avx-pr54700-1.C scan-assembler-times vblendvpd 4 g++: g++.target/i386/avx-pr54700-1.C scan-assembler-times vblendvps 4 g++: g++.target/i386/avx-pr54700-1.C scan-assembler-times vpblendvb 2 g++: g++.target/i386/avx2-pr54700-1.C scan-assembler-not vpcmpgt[bdq] g++: g++.target/i386/avx2-pr54700-1.C scan-assembler-times vblendvpd 4 g++: g++.target/i386/avx2-pr54700-1.C scan-assembler-times vblendvps 4 g++: g++.target/i386/avx2-pr54700-1.C scan-assembler-times vpblendvb 2 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++14 g++scan-assembler-times vmaxph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++14 g++scan-assembler-times vminph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++17 g++scan-assembler-times vmaxph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++17 g++scan-assembler-times vminph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++20 g++scan-assembler-times vmaxph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++20 g++scan-assembler-times vminph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++98 g++scan-assembler-times vmaxph 3 g++: g++.target/i386/avx512fp16-vcondmn-minmax.C -std=gnu++98 g++scan-assembler-times vminph 3 g++: g++.target/i386/pr100637-1b.C -std=gnu++14 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++17 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++20 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr100637-1b.C -std=gnu++98 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++14 scan-assembler-times g++pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++17 scan-assembler-times g++pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++20 scan-assembler-times g++pcmpeqw 2 g++: g++.target/i386/pr100637-1w.C -std=gnu++98 scan-assembler-times g++pcmpeqw 2 g++: g++.target/i386/pr100738-1.C -std=gnu++14 scan-assembler-not g++vpcmpeqd[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++14 scan-assembler-not g++vpxor[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++14 scan-assembler-times g++vblendvps[ \\t] 2 g++: g++.target/i386/pr100738-1.C -std=gnu++17 scan-assembler-not g++vpcmpeqd[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++17 scan-assembler-not g++vpxor[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++17 scan-assembler-times g++vblendvps[ \\t] 2 g++: g++.target/i386/pr100738-1.C -std=gnu++20 scan-assembler-not g++vpcmpeqd[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++20 scan-assembler-not g++vpxor[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++20 scan-assembler-times g++vblendvps[ \\t] 2 g++: g++.target/i386/pr100738-1.C -std=gnu++98 scan-assembler-not g++vpcmpeqd[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++98 scan-assembler-not g++vpxor[ \\t] g++: g++.target/i386/pr100738-1.C -std=gnu++98 scan-assembler-times g++vblendvps[ \\t] 2 g++: g++.target/i386/pr103861-1.C -std=gnu++14 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++17 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++20 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr103861-1.C -std=gnu++98 scan-assembler-times g++pcmpeqb 2 g++: g++.target/i386/pr61747.C -std=gnu++14 scan-assembler-times max 4 g++: g++.target/i386/pr61747.C -std=gnu++14 scan-assembler-times min 4 g++: g++.target/i386/pr61747.C -std=gnu++17 scan-assembler-times max 4 g++: g++.target/i386/pr61747.C -std=gnu++17 scan-assembler-times min 4 g++: g++.target/i386/pr61747.C -std=g
[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021 --- Comment #5 from Hongtao Liu --- It's fixed by r15-1100-gec985bc97a0157
[Bug target/115463] [15 regression] 526.blender_r regressed 5% on Zen2 with -Ofast -flto -march=native since r15-1058-gc989e59fc99d99
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115463 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #4 from Hongtao Liu --- should be fixed by r15-1293-g83a765768510d1f329887116757d6818d7846717.
[Bug target/115462] [15 regression] 416.gamess regressed 4-6% on x86_64 since r15-882-g1d6199e5f8c1c0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- (In reply to Richard Biener from comment #1) > it might possibly affect IVOPTs Probably, we're investigating.
[Bug target/115452] ICE when dump stv2 for gcc.target/i386/pr70322-2.c with -march=cascadelake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115452 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Hongtao Liu --- Fixed in GCC15.
[Bug target/115452] New: ICE when dump stv2 for gcc.target/i386/pr70322-2.c with -march=cascadelake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115452 Bug ID: 115452 Summary: ICE when dump stv2 for gcc.target/i386/pr70322-2.c with -march=cascadelake Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Target Milestone: --- gcc -m32 -march=cascadelake ./gcc/testsuite/gcc.target/i386/pr70322-2.c -mstv -mno-bmi -S -Os -fdump-rtl-stv2-details ./gcc/testsuite/gcc.target/i386/pr70322-2.c: In function ‘foo’: ./gcc/testsuite/gcc.target/i386/pr70322-2.c:12:1: internal compiler error: RTL check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934 12 | } | ^ 0x88ef75 rtl_check_failed_code1(rtx_def const*, rtx_code, char const*, int, char const*) ./gcc/rtl.cc:770 0x96be78 rhs_regno(rtx_def const*) ./gcc/rtl.h:1934 0x96cd8d rhs_regno(rtx_def const*) ./genrtl.h:38 0x96cd8d convert_op ./gcc/config/i386/i386-features.cc:1056 0x1af7711 convert_insn ./gcc/config/i386/i386-features.cc:1468 0x1af9808 convert ./gcc/config/i386/i386-features.cc:1987 0x1af9808 convert_scalars_to_vector ./gcc/config/i386/i386-features.cc:2536 0x1af9808 execute ./gcc/config/i386/i386-features.cc:2750 cut from i386-features.cc:1056--- if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", INSN_UID (insn), REGNO (tmp)); --cut end--- Looks like tmp is SUBREG.
[Bug rtl-optimization/115384] [15 Regression] ICE: RTL check: expected code 'const_int', have 'const_wide_int' in simplify_binary_operation_1, at simplify-rtx.cc:4088 since r15-1047-g7876cde25cbd2f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115384 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Hongtao Liu --- Fixed.
[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365 --- Comment #7 from Hongtao Liu --- +/* { dg-final { scan-rtl-dump-times {(?n)^(?!.*REG_EQUIV)(?=.*\(fix:SI)} 3 "final" } } */ Does this fix the testcase on solaris2?
[Bug target/115418] Extra movapd emitted for MAX implementation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115418 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #3 from Hongtao Liu --- (In reply to Andrew Pinski from comment #2) > Note the issue is ix86_expand_sse_fp_minmax only handles LT/UNGE but it > should handle GT/UNLT with both parts swapped (comparison and true/false). > GT/UNLT is "canonicalized" to GT/UNGT in ix86_prepare_sse_fp_compare_args 4410case GE: 4411case GT: 4412case UNLE: 4413case UNLT: 4414 /* These are not supported directly before AVX, and furthermore 4415 ix86_expand_sse_fp_minmax only optimizes LT/UNGE. Swap the 4416 comparison operands to transform into something that is 4417 supported. */ 4418 std::swap (*pop0, *pop1); 4419 code = swap_condition (code);
[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365 Hongtao Liu changed: What|Removed |Added Target|powerpc64le-linux-gnu, |powerpc64le-linux-gnu, |sparc-sun-solaris2.11 |sparc-sun-solaris2.11, ||arm-eabi, cortex-m0 --- Comment #6 from Hongtao Liu --- Also failed arm-eabi cortex-m0
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 --- Comment #6 from Hongtao Liu --- For 1 element vector, when backend doesn't support it's vector mode, the scalar mode is used for the type, which makes expand_vec_cond_expr_p use QImode for icode check.(vcond_mask_qiqi) It could also be the case when both data type and cmp_type are vector_boolean_type. It looks like vcond_mask_qiqi is dichotomous. For the former, it should be operands[3] == 1 ? operands[1] : operands[2] since mask is vector 1 boolean. For the latter, it should be (operand[1] & operand[3]) | (operand[2] & ~operand[3]) BTW, when assign -1 to vector(1) , should the upper bit be cleared? Look like only 1 element boolean vector is cleared, but not vector(2) . If the upper bits are not cleared, both 2 cases are equal.
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 --- Comment #5 from Hongtao Liu --- > _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>; Hmm, it should check vcond_mask_qiv1qi instead of vcond_mask_qiqi, I guess since the backend doesn't supports v1qi, TYPE_MODE of V is QImode, then it wrongly checked vcond_mask_qiqi.
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 --- Comment #4 from Hongtao Liu --- > > and for _2 = VIEW_CONVERT_EXPR(_1); we explicitly > clear the upper bits due to PR113576, and then we get 1 hit the abort. It's not VIEW_CONVERT_EXPR clear the uppper bits, but _1 = { -1 };
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 --- Comment #3 from Hongtao Liu --- typedef __attribute__((__vector_size__ (1))) char V; char foo (V v) { return ((V) v == v)[0]; } int main () { char x = foo ((V) { }); if (x != -1) __builtin_abort (); } w/ vcond_mask_qiqi, it's not lowered by veclower, and we get char foo (V v) { vector(1) signed char D.5142; char D.5141; vector(1) _1; vector(1) signed char _2; char _5; : _1 = { -1 }; _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>; D.5142 = _2; _5 = VIEW_CONVERT_EXPR(D.5142); : : return _5; } But it's further simplified to char foo (V v) { vector(1) signed char D.3765; char D.3764; vector(1) _1; vector(1) signed char _2; char _5; : _1 = { -1 }; _2 = VIEW_CONVERT_EXPR(_1); D.3765 = _2; _5 = VIEW_CONVERT_EXPR(D.3765); : : return _5; } by isel and for _2 = VIEW_CONVERT_EXPR(_1); we explicitly clear the upper bits due to PR113576, and then we get 1 hit the abort. It sound to me _1 = { -1 }; _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>; shouldn't be simplified to _2 = VIEW_CONVERT_EXPR(_1); when nunits is less than mode precision since the upper bit will be cleared.
[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406 Hongtao Liu changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #2 from Hongtao Liu --- I'll take a look.
[Bug rtl-optimization/115384] [15 Regression] ICE: RTL check: expected code 'const_int', have 'const_wide_int' in simplify_binary_operation_1, at simplify-rtx.cc:4088 since r15-1047-g7876cde25cbd2f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115384 Hongtao Liu changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #3 from Hongtao Liu --- Mine.
[Bug testsuite/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #3 from Hongtao Liu --- Should be fixed by r15-1088-gb24f2954dbc13d
[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365 --- Comment #5 from Hongtao Liu --- (In reply to Rainer Orth from comment #4) > Unfortunately, the fix broke 32-bit Solaris/SPARC in exchange: > > FAIL: gcc.dg/pr100927.c scan-rtl-dump-times final "(?n)(fix:SI" 3 > /* { dg-final { scan-rtl-dump-times {(?n)^[ \t]*\(fix:SI} 3 "final" } } */ The new fix is to check there're only space or tab before (fix:SI, and use "^[ \t]*", so does solaris use ^ as line header? I try grep "^[ \t]*(fix:SI" your.dump (fix:SI (fix:SF (reg:SF 40 %f8 [111] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":13:1 213 {fix_truncsfsi2} (fix:SI (fix:SF (reg:SF 40 %f8 [111] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":22:1 213 {fix_truncsfsi2} (fix:SI (fix:SF (reg:SF 40 %f8 [111] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":31:1 213 {fix_truncsfsi2} (fix:SI (fix:SF (reg:SF 40 %f8 [112] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":12:10 213 {fix_truncsfsi2} (fix:SI (fix:SF (reg:SF 40 %f8 [112] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":21:10 213 {fix_truncsfsi2} (fix:SI (fix:SF (reg:SF 40 %f8 [112] "/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":30:10 213 {fix_truncsfsi2} And it works on my x86-pc-linux-gnu machine.
[Bug target/115370] [15 regression] gcc.target/i386/pr77881.c FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115370 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- We can add a target_hook, targetm.support_ccmp_p, default implementation can be targetm.gen_ccmp_first == NULL
[Bug rtl-optimization/115369] New: ifcvt failed to condition elimination for__builtin_mul_overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115369 Bug ID: 115369 Summary: ifcvt failed to condition elimination for__builtin_mul_overflow Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: liuhongt at gcc dot gnu.org Target Milestone: --- int foo (unsigned a, unsigned b, unsigned d, unsigned e, int* p) { unsigned int r; int c = __builtin_mul_overflow (a, b, ); d += c; return c ? d : e; } (jump_insn 14 13 47 2 (set (pc) (if_then_else (eq (reg:CCO 17 flags) (const_int 0 [0])) (label_ref 17) (pc))) "/app/example.c":5:13 1212 {*jcc} (expr_list:REG_DEAD (reg:CCO 17 flags) (int_list:REG_BR_PROB 536868 (nil))) -> 17) (note 47 14 17 3 [bb 3] NOTE_INSN_BASIC_BLOCK) ; pc falls through to BB 5 (code_label 17 47 40 4 3 (nil) [1 uses]) (note 40 17 29 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 29 40 30 4 (parallel [ (set (reg/v:SI 105 [ e ]) (plus:SI (reg/v:SI 104 [ d ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "/app/example.c":6:7 272 {*addsi_1} (expr_list:REG_DEAD (reg/v:SI 104 [ d ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (code_label 30 29 31 5 4 (nil) [0 uses]) (note 31 30 36 5 [bb 5] NOTE_INSN_BASIC_BLOCK) (insn 36 31 37 5 (set (reg/i:SI 0 ax) (reg/v:SI 105 [ e ])) "/app/example.c":8:1 85 {*movsi_internal} (expr_list:REG_DEAD (reg/v:SI 105 [ e ]) (nil))) (insn 37 36 0 5 (use (reg/i:SI 0 ax)) "/app/example.c":8:1 -1 (nil)) The ce2 dump looks quite simple, not sure why it failed.
[Bug target/43618] Incorrect sse2_cvtX2Y pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43618 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from Hongtao Liu --- The pattern issue is fixed in GCC13.1 and later.
[Bug target/43618] Incorrect sse2_cvtX2Y pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43618 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #4 from Hongtao Liu --- The pattern issue is fixed in GCC13.1 and later.
[Bug other/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365 --- Comment #1 from Hongtao Liu --- pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120]))) "../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":12:10 428 {*fix_truncsfsi2_p8} pr100927.c.349r.final: (expr_list:REG_EQUIV (fix:SI (const_double:SF 2.147483648e+9 [0x0.8p+32])) pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120]))) "../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":21:10 428 {*fix_truncsfsi2_p8} pr100927.c.349r.final: (expr_list:REG_EQUIV (fix:SI (const_double:SF -Inf [-Inf])) pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120]))) "../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":30:10 428 {*fix_truncsfsi2_p8} there're 5 fix:SI in the final dump.
[Bug target/114428] [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0xffff x4) can be optimized to psrld
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Hongtao Liu --- Fixed in GCC15.
[Bug rtl-optimization/115351] [14/15 regression] pointless movs when passing by value on x86-64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115351 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #2 from Hongtao Liu --- There're (insn 5 4 6 2 (set (reg:TI 110) (ior:TI (and:TI (reg:TI 110) (const_wide_int 0x)) (zero_extend:TI (subreg:DI (reg:DF 111) 0 "/app/example.cpp":8:1 136 {*insvti_lowpart_1} (nil)) (insn 6 5 7 2 (set (reg:TI 110) (ior:TI (and:TI (reg:TI 110) (const_wide_int 0x0)) (ashift:TI (zero_extend:TI (subreg:DI (reg:DF 112) 0)) (const_int 64 [0x40] "/app/example.cpp":8:1 133 {*insvti_highpart_1} (nil)) (insn 7 6 8 2 (set (reg/v:TI 109 [ z ]) in GCC14's rtl dump, guess related to r14-589-g1e3054d27c83ee?
[Bug target/115341] [15 regression] gcc.target/i386/apx-ndd-2.c etc. FAIL
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115341 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #1 from Hongtao Liu --- I think it's because binutils2.42 only has initial support Intel APX: 32 GPRs, NDD, PUSH2/POP2 and PUSHP/POPP. APX NF is on latest binutils trunk. and target apxf only check the initial support, I guess we need to add a separate target to check for the remaining APXF features(NF,CCMP/CTEST/CFCMOV).
[Bug other/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334 --- Comment #2 from Hongtao Liu --- diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c b/gcc/testsuite/gcc.dg/vect/pr112325.c index dea6cca3b86..143903beab2 100644 --- a/gcc/testsuite/gcc.dg/vect/pr112325.c +++ b/gcc/testsuite/gcc.dg/vect/pr112325.c @@ -3,6 +3,7 @@ /* { dg-require-effective-target vect_int } */ /* { dg-require-effective-target vect_shift } */ /* { dg-additional-options "-mavx2" { target x86_64-*-* i?86-*-* } } */ +/* { dg-additional-options "--param max-completely-peeled-insns=200" { target powerpc64*-*-* } } */ typedef unsigned short ggml_fp16_t; static float table_f32_f16[1 << 16]; Does this patch work for you?
[Bug other/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #1 from Hongtao Liu --- power backend set param_max_completely_peeled_insns to 400, so the inner loop is still completed unrolled. So the testcase needs extra option for power backend --param max-completely-peeled-insns=200.
[Bug target/115299] [14/15 regression] pr86722.c failed to eliminate branch.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115299 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from Hongtao Liu --- Fixed in GCC15
[Bug target/113609] EQ/NE comparison between avx512 kmask and -1 can be optimized with kxortest with checking CF.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113609 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #4 from Hongtao Liu --- Fixed in GCC15
[Bug target/115299] [14/15 regression] pr86722.c failed to eliminate branch.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115299 --- Comment #2 from Hongtao Liu --- > Maybe r14-53-g675b1a7f113adb . Probably, current cost model may need adjustment.