[PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-29 Thread liuhongt
(insn 98 94 387 2 (parallel [ (set (reg:TI 337 [ _32 ]) (ashift:TI (reg:TI 329) (reg:QI 521))) (clobber (reg:CC 17 flags)) ]) "test.c":11:13 953 {ashlti3_doubleword} is reloaded into (insn 98 452 387 2 (parallel [

[PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-26 Thread liuhongt
(insn 98 94 387 2 (parallel [ (set (reg:TI 337 [ _32 ]) (ashift:TI (reg:TI 329) (reg:QI 521))) (clobber (reg:CC 17 flags)) ]) "test.c":11:13 953 {ashlti3_doubleword} is reloaded into (insn 98 452 387 2 (parallel [

[PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.

2024-07-24 Thread liuhongt
For below pattern, RA may still allocate r162 as v/k register, try to reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi which result a linker error. (set (reg:DI 162) (mem/u/c:DI (const:DI (unspec:DI [(symbol_ref:DI ("a") [flags 0x60] )]

[PATCH] Relax ix86_hardreg_mov_ok after split1.

2024-07-22 Thread liuhongt
ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4 >The solution proposed here is to have the x86 backend/recog prevent >early RTL passes composing instructions (that set likely_spilled hard >registers) that they (combine) can't simplify, until after reload. >We allow sets

[PATCH v2] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-17 Thread liuhongt
> Also, in case the insn is deleted, do: > > emit_note (NOTE_INSN_DELETED); > > DONE; > > instead of leaving (const_int 0) in the stream. > > So, the above insn preparation statements should read: > > --cut here-- > if (constm1_operand (operands[2], mode)) > emit_move_insn (operands[0],

[PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-16 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/115843 * config/i386/predicates.md (const0_or_m1_operand): New predicate. * config/i386/sse.md (*_store_mask_1): New pre_reload define_insn_and_split.

[PATCH] Fix SSA_NAME leak due to def_stmt is removed before use_stmt.

2024-07-11 Thread liuhongt
>- _5 = __atomic_fetch_or_8 (_work_pending_p, 1, 0); >- # DEBUG old => (long int) _5 >+ _6 = .ATOMIC_BIT_TEST_AND_SET (_work_pending_p, 0, 1, 0, >__atomic_fetch_or_8); >+ # DEBUG old => NULL > # DEBUG BEGIN_STMT >- # DEBUG D#2 => _5 & 1 >+ # DEBUG D#2 => NULL >... >- _10 = ~_5; >- _8 =

[PATCH] Rename __{float, double}_u to __x86_{float, double}_u to avoid pulluting the namespace.

2024-07-07 Thread liuhongt
I have a build failure on NetBSD as the namespace pollution avoidance causes a direct hit with the system /usr/include/math.h === In file included from /usr/src/local/gcc/obj/gcc/include/emmintrin.h:31, from

[PATCH] [committed] Use __builtin_cpu_support instead of __get_cpuid_count.

2024-07-04 Thread liuhongt
>> Hmm, now all avx512 tests SIGILL when testing with -m32: >> >> Dump of assembler code for function __get_cpuid_count: >> => 0x08049500 <+0>:     kmovd  %eax,%k2 >>    0x08049504 <+4>:     kmovd  %edx,%k1 >>    0x08049508 <+8>:     pushf >>    0x08049509 <+9>:     pushf >>    0x0804950a <+10>:  

[PATCH V2] x86: Update branch hint for Redwood Cove.

2024-07-03 Thread liuhongt
From: "H.J. Lu" >The above reads like it would be worth splitting branc_prediction_hits >into branch_prediction_hints_taken and branch_prediction_hints_not_taken >given not-taken is the default and thus will just increase code size? >According to Intel® 64 and IA-32 Architectures Optimization

[PATCH][committed] Move runtime check into a separate function and guard it with target ("no-avx")

2024-07-03 Thread liuhongt
The patch can avoid SIGILL on non-AVX512 machine due to kmovd is generated in dynamic check. Committed as an obvious fix. gcc/testsuite/ChangeLog: PR target/115748 * gcc.target/i386/avx512-check.h: Move runtime check into a separate function and guard it with target

[PATCH] x86: Update branch hint for Redwood Cove.

2024-07-01 Thread liuhongt
From: "H.J. Lu" According to Intel® 64 and IA-32 Architectures Optimization Reference Manual[1], Branch Hint is updated for Redwood Cove. cut from [1]- Starting with the Redwood Cove microarchitecture, if the predictor has no stored information about a branch,

[PATCH 2/3] Extend lshifrtsi3_1_zext to ?k alternative.

2024-06-27 Thread liuhongt
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which cause extra mov between gpr and kmask, add ?k to the pattern. gcc/ChangeLog: PR target/115610 * config/i386/i386.md (<*insnsi3_zext): Add alternative ?k, enable it only for lshiftrt and under avx512bw.

[PATCH 3/3] [x86] Enable flate-combine.

2024-06-27 Thread liuhongt
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also define target_insn_cost to prevent post_reload pass_late_combine to revert the optimziation did in pass_rpad. Adjust testcases since pass_late_combine generates better code but break scan assembly. .i.e Under 32-bit target,

[PATCH 0/3][x86] Enable pass_late_combine for x86.

2024-06-27 Thread liuhongt
operation. After enabling flate_combine, they're combined into embeded broadcast operations. Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means there're lots of small improvements. Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. Ok for trunk? liuhongt (3

[PATCH 1/3] [avx512 testsuite] Define mask as extern instead of uninitialized local variables.

2024-06-27 Thread liuhongt
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations with k mask, but mask is defined as uninitialized local variable which will be set as 0 at rtl expand phase. And it's further simplified off by late_combine which caused scan assembly failure. Move the definition of mask outside

[PATCH] Fix native_encode_vector_part for itype when TYPE_PRECISION (itype) == BITS_PER_UNIT

2024-06-27 Thread liuhongt
for the testcase in the PR115406, here is part of the dump. char D.4882; vector(1) _1; vector(1) signed char _2; char _5; : _1 = { -1 }; When assign { -1 } to vector(1} {signed-boolean:8}, Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest with each vector

[PATCH 7/7] Remove vcond{, u, eq} expanders since they will be obsolete.

2024-06-27 Thread liuhongt
gcc/ChangeLog: PR target/115517 * config/i386/mmx.md (vcondv2sf): Removed. (vcond): Ditto. (vcond): Ditto. (vcondu): Ditto. (vcondu): Ditto. * config/i386/sse.md (vcond): Ditto. (vcond): Ditto. (vcond): Ditto.

[PATCH 5/7] Adjust testcase for the regressed testcases after obsolete of vcond{, u, eq}.

2024-06-27 Thread liuhongt
> Richard suggests that we implement the "obvious" transforms like > inversion in the middle-end but if for example unsigned compares > are not supported the us_minus + eq + negative trick isn't on > that list. > > The main reason to restrict vec_cmp would be to avoid > a <= b ? c : d going with

[PATCH 2/7] Lower AVX512 kmask comparison back to AVX2 comparison when op_{true, false} is vector -1/0.

2024-06-27 Thread liuhongt
gcc/ChangeLog PR target/115517 * config/i386/sse.md (*_cvtmask2_not): New pre_reload splitter. (*_cvtmask2_not): Ditto. (*avx2_pcmp3_6): Ditto. (*avx2_pcmp3_7): Ditto. --- gcc/config/i386/sse.md | 97

[PATCH 4/7] Add more splitter for mskmov with avx512 comparison.

2024-06-27 Thread liuhongt
gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*_movmsk_lt_avx512): New define_insn_and_split. (*_movmsk_ext_lt_avx512): Ditto. (*_pmovmskb_lt_avx512): Ditto. (*_pmovmskb_zext_lt_avx512): Ditto.

[PATCH 6/7] [x86] Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

2024-06-27 Thread liuhongt
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31 and x < 0 ? 1 : 0 into (unsigned) x >> 31. Add define_insn_and_split for the optimization did in ix86_expand_int_vcond. gcc/ChangeLog: PR target/115517 * config/i386/sse.md ("*ashr3_1"): New define_insn_and_split.

[PATCH 1/7] [x86] Add more splitters to match (unspec [op1 op2 (gt op3 constm1_operand)] UNSPEC_BLENDV)

2024-06-27 Thread liuhongt
These define_insn_and_split are needed after vcond{,u,eq} is obsolete. gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*_blendv_gt): New define_insn_and_split. (*_blendv_gtint): Ditto. (*_blendv_not_gtint): Ditto.

[PATCH 0/7][x86] Remove vcond{,u,eq} expanders.

2024-06-27 Thread liuhongt
x86-64 -O2 -march=sapphirerapids -O2 Didn't observe obvious performance change, mostly same binaries. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Any comments? liuhongt (7): [x86] Add more splitters to match (unspec [op1 op2 (gt op3 constm1_operand)] UNSPEC_BLENDV) Lower AV

[PATCH 3/7] [x86] Match IEEE min/max with UNSPEC_IEEE_{MIN,MAX}.

2024-06-27 Thread liuhongt
These versions of the min/max patterns implement exactly the operations min = (op1 < op2 ? op1 : op2) max = (!(op1 < op2) ? op1 : op2) gcc/ChangeLog: PR target/115517 * config/i386/sse.md (*minmax3_1): New pre_reload define_insn_and_split. (*minmax3_2):

[PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread liuhongt
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that > recursive processing at any level. You're dealing with MEM [addr] > here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always > the best way to deal with this? Since this is the MEM [addr] case > we know it's not LEA, no?

[PATCH] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread liuhongt
416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0. The commit adjust rtx_cost of mem to reduce cost of (add op0 disp). But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea. It is the case in the PR, the patch uses lower cost to enable more simplication and fix

[PATCH V3 Committed] [x86] Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

2024-06-25 Thread liuhongt
Here's the patch committed. Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31 and x < 0 ? 1 : 0 into (unsigned) x >> 31. Move the optimization did in ix86_expand_int_vcond to match.pd gcc/ChangeLog: PR target/114189 * match.pd: Simplify a < 0 ? -1 : 0 to (signed) >> 31 and a

[PATCH V2] [x86] Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

2024-06-23 Thread liuhongt
> I think the check for TYPE_UNSIGNED should be of TREE_TYPE (@0) rather > than type here. Changed > Or maybe you need `types_match (type, TREE_TYPE (@0))` too. And use tree_nop_conversion_p (type, TREE_TYPE (@0)) and add view_convert to rshift. Bootstrapped and regtested on

[PATCH] [match.pd] Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

2024-06-20 Thread liuhongt
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31 and x < 0 ? 1 : 0 into (unsigned) x >> 31. Move the optimization did in ix86_expand_int_vcond to match.pd Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, aarch64-linux-gnu. Ok for trunk? gcc/ChangeLog: PR target/114189

[PATCH] Remove one_if_conv for latest Intel processors and Generic.

2024-06-13 Thread liuhongt
The tune is added by PR79390 for SciMark2 on Broadwell. For latest GCC, with and without the -mtune-ctrl=^one_if_conv_insn. GCC will generate the same binary for SciMark2. And for SPEC2017, there's no big impact for SKX/CLX/ICX, and small improvements on SPR and later. gcc/ChangeLog: *

[PATCH Committed] Fix ICE due to REGNO of a SUBREG.

2024-06-12 Thread liuhongt
Use reg_or_subregno instead. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Committed as an obvious patch. gcc/ChangeLog: PR target/115452 * config/i386/i386-features.cc (scalar_chain::convert_op): Use reg_or_subregno instead of REGNO to avoid ICE.

[PATCH] Adjust ix86_rtx_costs for pternlog_operand_p.

2024-06-12 Thread liuhongt
r15-1100-gec985bc97a0157 improves handling of ternlog instructions, now GCC can recognize lots of pternlog_operand with different variants. The patch adjust rtx_costs for that, so pass_combine can reasonably generate more optimal vpternlog instructions. .i.e for avx512f-vpternlog-3.c, with the

[PATCH V2] Fix ICE in rtl check due to CONST_WIDE_INT in CONST_VECTOR_DUPLICATE_P

2024-06-11 Thread liuhongt
> > I think if you only handle CONST_INT_P, you should check just for that, and > in both places where you check for CONST_VECTOR_DUPLICATE_P (there is one > spot 2 lines above this). > So add > && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0)) > and > && CONST_INT_P (XVECEXP (op1, 0, 0)) > tests

[PATCH] Fix ICE in rtl check due to CONST_WIDE_INT in CONST_VECTOR_DUPLICATE_P

2024-06-10 Thread liuhongt
In theory, const_wide_int can also be handle with extra check for each components of the HOST_WIDE_INT array, and the check is need for both shift and bit_and operands. I assume the optimization opportnunity is rare, so the patch just add extra check to make sure GET_MODE_INNER (mode) can fix

[PATCH committed] Add additional option --param max-completely-peeled-insns=200 for power64*-*-*

2024-06-06 Thread liuhongt
gcc/testsuite/ChangeLog: * gcc.dg/vect/pr112325.c:Add additional option --param max-completely-peeled-insns=200 for power64*-*-*. --- gcc/testsuite/gcc.dg/vect/pr112325.c | 1 + 1 file changed, 1 insertion(+) diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c

[PATCH Committed] Refine testcase for power10.

2024-06-05 Thread liuhongt
For power10, there're extra 3 REG_EQUIV notes with (fix:SI. to avoid the failure. Check (fix:SI is from the pattern not NOTE. gcc/testsuite/ChangeLog: PR target/115365 * gcc.dg/pr100927.c: Don't scan fix:SI from the note. --- gcc/testsuite/gcc.dg/pr100927.c | 2 +- 1 file

[V2 PATCH] Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for vector mode.

2024-06-04 Thread liuhongt
> Can you add a testcase for this? I don't mind if it's x86 specific and > does a bit of asm scanning. > > Also note that the context for this patch has changed, so it won't > automatically apply. So be extra careful when updating so that it goes > into the right place (all the more reason to

[PATCH] [x86] Adjust testcase for -march=cascadelake

2024-06-03 Thread liuhongt
Commit as an obvious patch. gcc/testsuite/ChangeLog: PR target/115299 * gcc.target/i386/pr86722.c: Also scan for blendvpd. --- gcc/testsuite/gcc.target/i386/pr86722.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.target/i386/pr86722.c

[PATCH] [x86] Add some preference for floating point rtl ifcvt when sse4.1 is not available

2024-06-02 Thread liuhongt
W/o TARGET_SSE4_1, it takes 3 instructions (pand, pandn and por) for movdfcc/movsfcc, and could possibly fail cost comparison. Increase branch cost could hurt performance for other modes, so specially add some preference for floating point ifcvt. Bootstrapped and regtested on

[committed] [x86] Rename double_u with __double_u to avoid pulluting the namespace.

2024-05-30 Thread liuhongt
Committed as an obvious patch. gcc/ChangeLog: * config/i386/emmintrin.h (__double_u): Rename from double_u. (_mm_load_sd): Replace double_u with __double_u. (_mm_store_sd): Ditto. (_mm_loadh_pd): Ditto. (_mm_loadl_pd): Ditto. *

[PATCH] [x86] Support vcond_mask_qiqi and friends.

2024-05-28 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (vcond_mask_): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/pr114125.c: New test. --- gcc/config/i386/sse.md | 20

[PATCH V2] Reduce cost of MEM (A + imm).

2024-05-28 Thread liuhongt
> IMO, there is no need for CONST_INT_P condition, we should also allow > symbol_ref, label_ref and const (all allowed by > x86_64_immediate_operand predicate), these all decay to an immediate > value. Changed. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk. For MEM,

[PATCH][committed] [avx512] Fix predicate mismatch between vfcmaddcph's define_insn and define_expand.

2024-05-27 Thread liuhongt
When I applied Roger's patch [1], there's ICE due to it. The patch fix the latent bug. [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651365.html Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: * config/i386/sse.md (___mask):

[PATCH] Reduce cost of MEM (A + imm).

2024-05-27 Thread liuhongt
For MEM, rtx_cost iterates each subrtx, and adds up the costs, so for MEM (reg) and MEM (reg + 4), the former costs 5, the latter costs 9, it is not accurate for x86. Ideally address_cost should be used, but it reduce cost too much. So current solution is make constant disp as cheap as possible.

[PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-05-26 Thread liuhongt
Update in V2: Guard constant folding for overflow value in fold_convert_const_int_from_real with flag_trapping_math. Add -fno-trapping-math to related testcases which warn for overflow in conversion from floating point to integer. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for

[PATCH] Fix typo in the testcase.

2024-05-24 Thread liuhongt
Committed as an obvious patch. gcc/testsuite/ChangeLog: PR target/114148 * gcc.target/i386/pr106010-7b.c: Refine testcase. --- gcc/testsuite/gcc.target/i386/pr106010-7b.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git

[V3 PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-24 Thread liuhongt
Update in V3: > Since this was about vectorization can you instead add a testcase to > gcc.dg/vect/ and check for > vectorization to happen? Move to vect/pr112325.c. > > I believe the if (unr_insn <= 0) check can go as well. Removed. > as said, you want to do > > curolli = false; > >

[V2 PATCH] Don't reduce estimated unrolled size for innermost loop at cunrolli.

2024-05-21 Thread liuhongt
>> Hard to find a default value satisfying all testcases. >> some require loop unroll with 7 insns increment, some don't want loop >> unroll w/ 5 insn increment. >> The original 2/3 reduction happened to meet all those testcases(or the >> testcases are constructed based on the old 2/3). >> Can we

[PATCH] Don't simplify NAN/INF or out-of-range constant for FIX/UNSIGNED_FIX.

2024-05-21 Thread liuhongt
According to IEEE standard, for conversions from floating point to integer. When a NaN or infinite operand cannot be represented in the destination format and this cannot otherwise be indicated, the invalid operation exception shall be signaled. When a numeric operand would convert to an integer

[PATCH 2/2] [x86] Adjust rtx_cost for MEM to enable more simplication

2024-05-20 Thread liuhongt
For CONST_VECTOR_DUPLICATE_P in constant_pool, it is just broadcast or variants in ix86_vector_duplicate_simode_const. Adjust the cost to COSTS_N_INSNS (2) + speed which should be a little bit larger than broadcast. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk?

[PATCH 1/2] Simplify (AND (ASHIFTRT A imm) mask) to (LSHIFTRT A imm) for vector mode.

2024-05-20 Thread liuhongt
When mask is (1 << (prec - imm) - 1) which is used to clear upper bits of A, then it can be simplified to LSHIFTRT. i.e Simplify (and:v8hi (ashifrt:v8hi A 8) (const_vector 0xff x8)) to (lshifrt:v8hi A 8) Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok of trunk? gcc/ChangeLog:

[PATCH] Use pblendw instead of pand to clear upper 16 bits.

2024-05-16 Thread liuhongt
For vec_pack_truncv8si/v4si w/o AVX512, (const_vector:v4si (const_int 0x) x4) is used as mask to clear upper 16 bits, but vpblendw with zero_vector can also be used, and zero vector is cheaper than (const_vector:v4si (const_int 0x) x4). Bootstrapped and regtested on

[PATCH] [x86] Set d.one_operand_p to true when TARGET_SSSE3 in ix86_expand_vecop_qihi_partial.

2024-05-15 Thread liuhongt
pshufb is available under TARGET_SSSE3, so ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3. w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1 could return false. With the patch under -march=x86-64-v2 v8qi foo (v8qi a) { return a >> 5; } <

[PATCH] [x86] Optimize ashift >> 7 to vpcmpgtb for vector int8.

2024-05-14 Thread liuhongt
Since there is no corresponding instruction, the shift operation for vector int8 is implemented using the instructions for vector int16, but for some special shift counts, it can be transformed into vpcmpgtb. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk.

[PATCH] Don't reduce estimated unrolled size for innermost loop.

2024-05-12 Thread liuhongt
As testcase in the PR, O3 cunrolli may prevent vectorization for the innermost loop and increase register pressure. The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL. ul != UR_ALL is needed since some small loop complete unrolling at O2 relies the reduction.

[PATCH] Don't assert for IFN_COND_{MIN, MAX} in vect_transform_reduction

2024-04-29 Thread liuhongt
The Fortran standard does not specify what the result of the MAX and MIN intrinsics are if one of the arguments is a NaN. So it should be ok to tranform reduction for IFN_COND_MIN with vectorized COND_MIN and REDUC_MIN. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and

[PATCH] [x86] Optimize 64-bit vector permutation with punpcklqdq + 128-bit vector pshuf.

2024-04-28 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready push to trunk. gcc/ChangeLog: PR target/113090 * config/i386/i386-expand.cc (expand_vec_perm_punpckldq_pshuf): New function. (ix86_expand_vec_perm_const_1): Try expand_vec_perm_punpckldq_pshuf

[PATCH 1/2] [x86] Support dot_prod optabs for 64-bit vector.

2024-04-28 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/113079 * config/i386/mmx.md (usdot_prodv8qi): New expander. (sdot_prodv8qi): Ditto. (udot_prodv8qi): Ditto. (usdot_prodv4hi): Ditto.

[PATCH 2/2] Extend usdot_prodv*qi with vpmaddwd when AVXVNNI/AVX512VNNI is not available.

2024-04-28 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (usdot_prodv*qi): Extend to VI1_AVX512 with vpmaddwd when avxvnni/avx512vnni is not available. --- gcc/config/i386/sse.md | 55

[PATCH] Update libbid according to the latest Intel Decimal Floating-Point Math Library.

2024-04-27 Thread liuhongt
The Intel Decimal Floating-Point Math Library is available as open-source on Netlib[1]. [1] https://www.netlib.org/misc/intel/. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? libgcc/config/libbid/ChangeLog: * bid128_fma.c (add_and_round): Fix bug: the result

[PATCH] [x86] Adjust alternative *k to ?k for avx512 mask in zero_extend patterns

2024-04-27 Thread liuhongt
So when both source operand and dest operand require avx512 MASK_REGS, RA can allocate MASK_REGS register instead of GPR to avoid reload it from GPR to MASK_REGS. It's similar as what did for logic patterns. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog:

[PATCH V2] sanitizer: [PR110027] Align asan_vec[0] to MAX (BIGGEST_ALIGNMENT / BITS_PER_UNIT, ASAN_RED_ZONE_SIZE)

2024-03-26 Thread liuhongt
> > So, try to add some other variable with larger size and smaller alignment > > to the frame (and make sure it isn't optimized away). > > > > alignb above is the alignment of the first partition's var, if > > align_frame_offset really needs to depend on the var alignment, it probably > > should

[PATCH] Move pr114396.c from gcc.target/i386 to gcc.c-torture/execute.

2024-03-21 Thread liuhongt
Also fixed a typo in the testcase. Commit as an obvious fix. gcc/testsuite/ChangeLog: PR tree-optimization/114396 * gcc.target/i386/pr114396.c: Move to... * gcc.c-torture/execute/pr114396.c: ...here. --- .../{gcc.target/i386 => gcc.c-torture/execute}/pr114396.c | 6

[PATCH] Fix runtime error for nonlinear iv vectorization(step_mult).

2024-03-21 Thread liuhongt
wi::from_mpz doesn't take a sign argument, we want it to be wrapped instead of saturation, so pass utype and true to it, and it fixes the bug. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to gcc13? gcc/ChangeLog: PR tree-optimization/114396

[PATCH V2] Document -fexcess-precision=16.

2024-03-19 Thread liuhongt
gcc/ChangeLog: * doc/invoke.texi: Document -fexcess-precision=16. --- gcc/doc/invoke.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 85c938d4a14..6bc1ebf9721 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -14930,6

[PATCH] Document -fexcess-precision=16.

2024-03-18 Thread liuhongt
Ok for trunk? gcc/ChangeLog: * doc/invoke.texi: Document -fexcess-precision=16. --- gcc/doc/invoke.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 85c938d4a14..673420fdd3e 100644 --- a/gcc/doc/invoke.texi +++

[PATCH] i386 [stv]: Handle REG_EH_REGION note [pr111822].

2024-03-18 Thread liuhongt
Commit r14-9459-g618e34d56cc38e only handles general_scalar_chain::convert_op. The patch also handles timode_scalar_chain::convert_op to avoid potential similar bug. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to releases/gcc-13 branch? gcc/ChangeLog:

[PATCH] Add missing hf/bf patterns.

2024-03-17 Thread liuhongt
It fixes ICE of unrecognized logic operation insn which is generated by lroundmn2 expanders. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/114334 * config/i386/i386.md (mode): Add new number V8BF,V16BF,V32BF.

[PATCH] i386[stv]: Handle REG_EH_REGION note

2024-03-13 Thread liuhongt
When we split (insn 37 36 38 10 (set (reg:DI 104 [ _18 ]) (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 84 {*movdi_internal} (expr_list:REG_EH_REGION (const_int -11 [0xfff5])

[PATCH] sanitizer: [PR110027] Align asan_vec[0] to MAX (alignb, ASAN_RED_ZONE_SIZE)

2024-03-12 Thread liuhongt
if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of alignb. (base_align_bias - base_offset) may not aligned to alignb, and caused segement fault. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk and backport to GCC13? gcc/ChangeLog: PR sanitizer/110027

[PATCH] Fix testcase for platform without gnu/stubs-x32.h

2024-02-18 Thread liuhongt
target maybe_x32 doesn't check if platform has gnu/stubs-x32.h, but it's included by stdint.h in the testcase. Adjust testcase: remove stdint.h, use 'typedef long long int64_t' instead. Commit as an obvious patch. gcc/testsuite/ChangeLog: PR target/113711 *

[PATCH wwwdoc] Hardware-assisted AddressSanitizer now works for x86_64 with LAM_U57

2024-02-08 Thread liuhongt
--- htdocs/gcc-14/changes.html | 5 + 1 file changed, 5 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index 6d917535..a022357a 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -499,6 +499,11 @@ a work-in-progress.

[PATCH 1/2] Adjust hwasan testcase for x86 target.

2024-01-22 Thread liuhongt
There're 2 cases: 1. hwasan-poison-optimisation.c is supposed to scan call to __hwasan_tag_mismatch4, and x86 have different mnemonic(call) from aarch64(bl), so adjust testcase to scan either call or bl. 2. alloca-outside-caught.c/vararray-outside-caught.c are supposed to scan mismatched tags and

[PATCH 2/2] [x86] Enable -mlam=u57 by default when compiled with -fsanitize=hwaddress.

2024-01-22 Thread liuhongt
Ready push to trunk. gcc/ChangeLog: * config/i386/i386-options.cc (ix86_option_override_internal): Enable -mlam=u57 by default when compiled with -fsanitize=hwaddress. --- gcc/config/i386/i386-options.cc | 9 + 1 file changed, 9 insertions(+) diff --git

[PATCH] Adjust testcase gcc.target/i386/part-vect-copysignhf.c.

2024-01-18 Thread liuhongt
After vect_early_break is supported, more vectorization is enabled(3 COPYSIGN), so adjust testcase for that. Commit as obvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/part-vect-copysignhf.c: Remove -ftree-vectorize from dg-options. ---

[PATCH] Fix testcase failure on many platforms which don't support vect_int_max.

2024-01-18 Thread liuhongt
After r14-7124-g6686e16fda4190, the testcase can be optimized to MAX_EXPR if the backends support that. So I adjust the testcase to scan for MAX_EXPR, but it failed many platforms which don't support that. As pinski mentioned, target vect_no_int_min_max is only available under vect directory, so

[PATCH] Document refactoring of the option -fcf-protection=x.

2024-01-09 Thread liuhongt
To override -fcf-protection, -fcf-protection=none needs to be added and then with -fcf-protection=xxx. --- htdocs/gcc-14/changes.html | 6 ++ 1 file changed, 6 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index e3a68998..72b0d291 100644 ---

[PATCH] Update documents for fcf-protection=

2024-01-09 Thread liuhongt
After r14-2692-g1c6231c05bdcca, the option is defined as EnumSet and -fcf-protection=branch won't unset any others bits since they're in different groups. So to override -fcf-protection, an explicit -fcf-protection=none needs to be added and then with -fcf-protection=XXX Bootstrapped and

[PATCH] Optimize A < B ? A : B to MIN_EXPR.

2024-01-09 Thread liuhongt
> I wonder if you can amend the existing patterns instead by iterating > over cond/vec_cond.  There are quite some (look for uses of > minmax_from_comparison) that could be adapted to vectors. > > The ones matching the simple form you match are > > #if GIMPLE > /* A >= B ? A : B -> max (A, B) and

[PATCH] Optimize A < B ? A : B to MIN_EXPR.

2023-12-18 Thread liuhongt
Similar for A < B ? B : A to MAX_EXPR. There're codes in the frontend to optimize such pattern but failed to handle testcase in the PR since it's exposed at gimple level when folding backend builtins. pr95906 now can be optimized to MAX_EXPR as it's commented in the testcase. // FIXME: this

[PATCH] Force broadcast constant to mem for vec_dup{v4di, v8si, v4df, v8df} when TARGET_AVX2 is not available.

2023-12-12 Thread liuhongt
vpbroadcastd/vpbroadcastq is avaiable under TARGET_AVX2, but vec_dup{v4di,v8si} pattern is avaiable under AVX with memory operand. And it will cause LRA/Reload to generate spill and reload if we put constant in register. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to

[PATCH] Adjust vectorized cost for reduction.

2023-12-11 Thread liuhongt
x86 doesn't support horizontal reduction instructions, reduc_op_scal_m is emulated with vec_extract_half + op(half vector length) Take that into account when calculating cost for vectorization. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. No big performance impact on SPEC2017 as

[v3 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-12-10 Thread liuhongt
> since you are looking at TYPE_PRECISION below you want > VECTOR_INTIEGER_TYPE_P here as well? The alternative > would be to compare TYPE_SIZE. > > Some of the checks feel redundant but are probably good for > documentation purposes. > > OK with using VECTOR_INTIEGER_TYPE_P Actually, the data

[PATCH] [ICE] Support vpcmov for V4HF/V4BF/V2HF/V2BF under TARGET_XOP.

2023-12-07 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112904 * config/i386/mmx.md (*xop_pcmov_): New define_insn. gcc/testsuite/ChangeLog: * g++.target/i386/pr112904.C: New test. --- gcc/config/i386/mmx.md

[PATCH] Don't assume it's AVX_U128_CLEAN after call_insn whose abi.mode_clobber(V4DImode) deosn't contains all SSE_REGS.

2023-12-07 Thread liuhongt
If the function desn't clobber any sse registers or only clobber 128-bit part, then vzeroupper isn't issued before the function exit. the status not CLEAN but ANY after the function. Also for sibling_call, it's safe to issue an vzeroupper. Also there could be missing vzeroupper since there's no

[PATCH] Support udot_prodv*qi with emulation sdot_prodv*hi

2023-12-03 Thread liuhongt
Like r14-5990-gb4a7c1c8c59d19, but the patch optimized for udot_prod. Since (zero_extend) (unsigned char)-> int is equal to (zero_extend)(unsigned char) -> short + (sign_extend) (short) -> int Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. It should be safe to

[PATCH] Don't vectorize when vector stmts are only vec_contruct and stores

2023-12-03 Thread liuhongt
.i.e. for below cases. a[0] = b1; a[1] = b2; .. a[n] = bn; There're extra dependences when contructing the vector, but not for scalar store. According to experiments, it's generally worse. The patch adds an cut-off heuristic when vec_stmt is just vec_construct and vector store. It

[PATCH] Take register pressure into account for vec_construct/scalar_to_vec when the components are not loaded from memory.

2023-11-30 Thread liuhongt
> Hmm, I would suggest you put reg_needed into the class and accumulate > over all vec_construct, with your patch you pessimize a single v32qi > over two separate v16qi for example. Also currently the whole block is > gated with INTEGRAL_TYPE_P but register pressure would be also > a concern for

[PATCH] Use vec_extact_lo instead of subreg in reduc__scal_m.

2023-11-29 Thread liuhongt
Loop vectorizer will use vec_perm to select lower part of a vector, there could be some redundancy when using subreg in reduc__scal_m, because rtl cse can't figure out vec_select lower part is just subreg. I'm trying to canonicalize vec_select to subreg like aarch64 did, but there're so many

[PATCH] [x86] Support sdot_prodv*qi with emulation of sdot_prodv*hi.

2023-11-28 Thread liuhongt
Currently sdot_prodv*qi is available under TARGET_AVXVNNIINT8, but it can be emulated by vec_unpacks_lo_v32qi vec_unpacks_lo_v32qi vec_unpacks_hi_v32qi vec_unpacks_hi_v32qi sdot_prodv16hi sdot_prodv16hi add3v8si which is faster than original vect_patt_39.11_48 = WIDEN_MULT_LO_EXPR ;

[PATCH] Take register pressure into account for vec_construct when the components are not loaded from memory.

2023-11-27 Thread liuhongt
For vec_contruct, the components must be live at the same time if they're not loaded from memory, when the number of those components exceeds available registers, spill happens. Try to account that with a rough estimation. ??? Ideally, we should have an overall estimation of register pressure if

[PATCH] Set AVOID_256FMA_CHAINS TO m_GENERIC as it's generally good to new platforms

2023-11-21 Thread liuhongt
From: "Zhang, Annita" Avoid_fma_chain was enabled in m_SAPPHIRERAPIDS, m_ALDERLAKE and m_CORE_HYBRID. It can also be enabled in m_GENERIC to improve the performance of -march=x86-64-v3/v4 with -mtune=generic set by default. One SPEC2017 benchmark 510.parest_r can improve greatly due to it. From

[PATCH] [x86] Support reduc_{and, ior, xor}_scal_m for V4HI/V8QI/V4QImode

2023-11-19 Thread liuhongt
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112325 * config/i386/i386-expand.cc (emit_reduc_half): Hanlde V8QImode. * config/i386/mmx.md (reduc__scal_): New expander. (reduc__scal_v4qi): Ditto.

[PATCH] Support cbranchm for Vector HI/QImode.

2023-11-16 Thread liuhongt
The missing cbranchv*{hi,qi}4 maybe needed by early break vectorization. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (cbranch4): Extend to Vector HI/QImode. --- gcc/config/i386/sse.md | 10 -- 1 file

[PATCH 1/2] Support reduc_{plus, xor, and, ior}_scal_m for vector integer mode.

2023-11-16 Thread liuhongt
BB vectorizer relies on the backend support of .REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/112325 * config/i386/sse.md (reduc__scal_): New expander.

[PATCH 2/2] Add i?86-*-* and x86_64-*-* to vect_logical_reduc

2023-11-16 Thread liuhongt
x86 backend support reduc_{and,ior,xor>_scal_m for vector integer modes. Ok for trunk? gcc/testsuite/ChangeLog: * lib/target-supports.exp (vect_logical_reduc): Add i?86-*-* and x86_64-*-*. --- gcc/testsuite/lib/target-supports.exp | 3 ++- 1 file changed, 2 insertions(+), 1

[V2 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-11-16 Thread liuhongt
Update in V2: 1) Add some comments before the pattern. 2) Remove ? from view_convert. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? When I'm working on PR112443, I notice there's some misoptimizations: after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend

[PATCH] Fix ICE of unrecognizable insn.

2023-11-15 Thread liuhongt
The new added splitter will generate (insn 58 56 59 2 (set (reg:V4HI 20 xmm0 [129]) (vec_duplicate:V4HI (reg:HI 22 xmm2 [123]))) "testcase.c":16:21 -1 But we only have (define_insn "*vec_dupv4hi" [(set (match_operand:V4HI 0 "register_operand" "=y,Yw") (vec_duplicate:V4HI

[PATCH] Fix ICE in vectorizable_nonlinear_induction with bitfield.

2023-11-13 Thread liuhongt
if (TREE_CODE (init_expr) == INTEGER_CST) init_expr = fold_convert (TREE_TYPE (vectype), init_expr); else gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype), TREE_TYPE (init_expr))); and init_expr is a 24 bit integer type while vectype has

  1   2   3   4   5   6   >