[PATCH] Remove constraint modifier % for fcmaddcph/fmaddcph/fcmulcph since there're not commutative.

2023-09-10 Thread liuhongt via Gcc-patches
Here's the patch I've commited. The patch also remove % for vfmaddcph. gcc/ChangeLog: PR target/111306 PR target/111335 * config/i386/sse.md (int_comm): New int_attr. (fma__): Remove % for Complex conjugate operations since they're not commutative.

[PATCH] Remove constraint modifier % for fcmaddcph/fcmulcph since there're not commutative.

2023-09-07 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} on SPR. Ready push to trunk and backport to GCC13/GCC12. gcc/ChangeLog: PR target/111306 * config/i386/sse.md (int_comm): New int_attr. (fma__): Remove % for Complex conjugate operations since they're not

[PATCH] Support vpermw/vpermi2w/vpermt2w instructions for vector HF/BFmodes.

2023-09-06 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (_vpermt2var3): New define_insn. (VHFBF_AVX512VL): New mode iterator. (VI2HFBF_AVX512VL): New mode iterator. --- gcc/config/i386/sse.md | 32

[PATCH] Generate vmovsh instead of vpblendw for specific vec_merge.

2023-09-04 Thread liuhongt via Gcc-patches
On SPR, vmovsh can be execute on 3 ports, vpblendw can only be executed on 2 ports. On znver4, vpblendw can be executed on 4 ports, if vmovsh is similar as vmovss, then it can also be executed on 4 ports. So there's no difference for znver? but vmovsh is more optimized on SPR. Bootstrapped and

[PATCH] Adjust costing of emulated vectorized gather/scatter

2023-08-30 Thread liuhongt via Gcc-patches
r14-332-g24905a4bd1375c adjusts costing of emulated vectorized gather/scatter. commit 24905a4bd1375ccd99c02510b9f9529015a48315 Author: Richard Biener Date: Wed Jan 18 11:04:49 2023 +0100 Adjust costing of emulated vectorized gather/scatter Emulated gather/scatter behave similar

[PATCH] Refactor vector HF/BF mode iterators and patterns.

2023-08-30 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (_blendm): Merge VF_AVX512HFBFVL into VI12HFBF_AVX512VL. (VF_AVX512HFBF16): Renamed to VHFBF. (VF_AVX512FP16VL): Renamed to VHF_AVX512VL.

[PATCH] Use vmaskmov{ps, pd} for VI48_128_256 when TARGET_AVX2 is not available.

2023-08-24 Thread liuhongt via Gcc-patches
vpmaskmov{d,q} is available for TARGET_AVX2, vmaskmov{ps,ps} is available for TARGET_AVX, w/o TARGET_AVX2, we can use vmaskmov{ps,pd} for VI48_128_256 Bootstrapped and regtested on x86_64-pc-linux{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/19 * config/i386/sse.md

[PATCH] [x86] Refactor mode iterator V_128 and V_128H, V_256 and V_256H

2023-08-24 Thread liuhongt via Gcc-patches
Merge V_128H and V_256H into V_128 and V_256, adjust related patterns. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/sse.md (vec_set): Removed. (V_128H): Merge into .. (V_128): .. this. (V_256H): Merge

[PATCH] Fix target_clone ("arch=graniterapids-d") and target_clone ("arch=arrowlake-s")

2023-08-22 Thread liuhongt via Gcc-patches
Both "graniterapid-d" and "graniterapids" are attached with PROCESSOR_GRANITERAPID in processor_alias_table but mapped to different __cpu_subtype in get_intel_cpu. And get_builtin_code_for_version will try to match the first PROCESSOR_GRANITERAPIDS in processor_alias_table which maps to

[PATCH] [x86] Testcase fix.

2023-08-21 Thread liuhongt via Gcc-patches
Commit as an abvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/invariant-ternlog-1.c: Only scan %rdx under TARGET_64BIT. --- gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git

[PATCH] Adjust testcase for Intel GDS.

2023-08-21 Thread liuhongt via Gcc-patches
gcc/testsuite/ChangeLog: * gcc.target/i386/avx512f-pr88464-2.c: Add -mgather to options. * gcc.target/i386/avx512f-pr88464-3.c: Ditto. * gcc.target/i386/avx512f-pr88464-4.c: Ditto. * gcc.target/i386/avx512f-pr88464-6.c: Ditto. *

[PATCH] Mention Intel -march=gracemont for Alderlake-N.

2023-08-20 Thread liuhongt via Gcc-patches
--- htdocs/gcc-14/changes.html | 4 1 file changed, 4 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index eae25f1a..2c888660 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -151,6 +151,10 @@ a work-in-progress.

[PATCH] Support -march=gracemont

2023-08-18 Thread liuhongt via Gcc-patches
Alderlake-N is E-core only, add it as an alias of Alderlake. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Any comments? gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_intel_cpu): Detect Alderlake-N. * common/config/i386/i386-common.cc (alias_table):

[PATCH] Generate vmovapd instead of vmovsd for moving DFmode between SSE_REGS.

2023-08-13 Thread liuhongt via Gcc-patches
vmovapd can enable register renaming and have same code size as vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than vmovsh. When TARGET_AVX512VL is not available, still generate vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31. Bootstrapped and regtested on

[PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-11 Thread liuhongt via Gcc-patches
Rename original use_gather to use_gather_8parts, Support -mtune-ctrl={,^}use_gather to set/clear tune features use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather as alias of -mtune-ctrl=, use_gather, ^use_gather. Similar for use_scatter. How about this version? gcc/ChangeLog:

[PATCH] Software mitigation: Disable gather generation in vectorization for GDS affected Intel Processors.

2023-08-10 Thread liuhongt via Gcc-patches
For more details of GDS (Gather Data Sampling), refer to https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html After microcode update, there's performance regression. To avoid that, the patch disables gather

[PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-09 Thread liuhongt via Gcc-patches
Currently we have 3 different independent tunes for gather "use_gather,use_gather_2parts,use_gather_4parts", similar for scatter, there're "use_scatter,use_scatter_2parts,use_scatter_4parts" The patch support 2 standardizing options to enable/disable vectorization for all gather/scatter

[PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-09 Thread liuhongt via Gcc-patches
Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named patterns in order to avoid generation of partial vector V8HFmode trapping instructions. Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR target/110832 *

[PATCH] Rename local variable subleaf_level to max_subleaf_level.

2023-08-09 Thread liuhongt via Gcc-patches
This minor fix is preapproved in [1]. Committed to trunk. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626758.html gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_available_features): Rename local variable subleaf_level to max_subleaf_level. ---

[PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread liuhongt via Gcc-patches
> Please rather do it in a more self-descriptive way, as proposed in the > attached patch. You won't need a comment then. > Adjusted in V2 patch. Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported via EAX. Intel documentation says invalid subleaves return 0. We had been relying

[PATCH] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread liuhongt via Gcc-patches
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported via EAX. Intel documentation says invalid subleaves return 0. We had been relying on that behavior instead of checking the max sublef number. It appears that some Sandy Bridge CPUs return at least the subleaf 0 EDX value for

[PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-07 Thread liuhongt via Gcc-patches
Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/110762 * config/i386/mmx.md (3): Changed from define_insn to define_expand and break into ..

[PATCH] Fix ICE in rtl check when bootstrap.

2023-08-07 Thread liuhongt via Gcc-patches
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c: In function ‘matmul_i1_avx512f’: /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected

[PATCH] Optimize vlddqu + inserti128 to vbroadcasti128

2023-08-01 Thread liuhongt via Gcc-patches
In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics after AVX2, it's rejected as > The instruction is reachable only as __builtin_ia32_lddqu* (aka > _mm_lddqu_si*), so it was chosen by the programmer for a reason. I > think that in this case, the compiler should not be too

[PATCH] Support vec_fmaddsub/vec_fmsubadd for vector HFmode.

2023-08-01 Thread liuhongt via Gcc-patches
AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph. Also remove scalar mode from fmaddsub/fmsubadd pattern since there's no scalar instruction for that. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog: PR target/81904 *

[PATCH] Adjust testcase for more optimal codegen.

2023-07-31 Thread liuhongt via Gcc-patches
After b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb Author: Jan Hubicka Date: Fri Jul 28 09:16:09 2023 +0200 loop-split improvements, part 1 Now we have vpbroadcastd %ecx, %xmm0 vpaddd .LC3(%rip), %xmm0, %xmm0

[PATCH] [x86] Add UNSPEC_MASKOP to vpbroadcastm pattern.

2023-07-27 Thread liuhongt via Gcc-patches
Prevent rtl optimization of vec_duplicate + zero_extend to vpbroadcastm since there could be an extra kmov after RA. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready to push to trunk. gcc/ChangeLog: PR target/110788 * config/i386/sse.md (avx512cd_maskb_vec_dup):

[PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX

2023-07-20 Thread liuhongt via Gcc-patches
For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations. Can someone confirm this with AMD folks? If AMD doesn't like such optimization, I'll put my optimization under micro-architecture tuning. Bootstrapped and

[PATCH] Fix fp16 related testcase failure for i686.

2023-07-19 Thread liuhongt via Gcc-patches
> I see some regressions most likely with this change on i686-linux, > in particular: > +FAIL: gcc.dg/pr107547.c (test for excess errors) > +FAIL: gcc.dg/torture/floatn-convert.c -O0 (test for excess errors) > +UNRESOLVED: gcc.dg/torture/floatn-convert.c -O0 compilation failed to > produce

[PATCH] Remove # from one_cmpl2 assemble output.

2023-07-17 Thread liuhongt via Gcc-patches
optimize_insn_for_speed () in assemble output is not aligned with splitter condition, and it cause an ICE when building SPEC2017 blender_r. Not sure if ctrl is supposed to be reliable in assemble output, the patch just remove that as a walkaround. Bootstrapped and regtested on

[PATCH] Fix typo in the testcase.

2023-07-11 Thread liuhongt via Gcc-patches
Antony Polukhin 2023-07-11 09:51:58 UTC There's a typo at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87 It should be `|| !test3() || !test3r()` rather than `||

[PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-11 Thread liuhongt via Gcc-patches
Similar like we did for CMPXCHG, but extended to all ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same as CMP. When operand order in CMP insn is same as that in CMPCCXADD, CMP insn can be eliminated directly. When operand order is swapped in CMP insn, only optimize cmpccxadd +

[PATCH v2] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread liuhongt via Gcc-patches
Here's updated patch. 1. use optimize_insn_for_speed_p instead of using optimize_function_for_speed_p. 2. explicitly move memory to dest register to avoid false dependence in one_cmpl pattern. False dependency happens when destination is only updated by pternlog. There is no false dependency

[PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-10 Thread liuhongt via Gcc-patches
Similar like we did for cmpxchg, but extended to all ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same as CMP. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, Ok for trunk? gcc/ChangeLog: PR target/110591 * config/i386/sync.md (cmpccxadd_): Add a new

[PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-09 Thread liuhongt via Gcc-patches
False dependency happens when destination is only updated by pternlog. There is no false dependency when destination is also used in source. So either a pxor should be inserted, or input operand should be set with constraint '0'. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to

[PATCH V2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-06 Thread liuhongt via Gcc-patches
> Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX > and the other emitting UNSPEC_IEEE_MIN. Splitted. > The test involves blendv instruction, which is SSE4.1, so it is > pointless to test it without -msse4.1. Please add -msse4.1 instead of > -march=x86_64 and use

[PATCH 1/2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-05 Thread liuhongt via Gcc-patches
We have ix86_expand_sse_fp_minmax to detect min/max sematics, but it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for the testcase in the PR, there's an extra move from cmp_op0 to if_true, and it failed ix86_expand_sse_fp_minmax. This patch adds pre_reload splitter to detect the

[PATCH 2/2] Adjust rtx_cost for DF/SFmode AND/IOR/XOR/ANDN operations.

2023-07-05 Thread liuhongt via Gcc-patches
They should have same cost as vector mode since both generate pand/pandn/pxor/por instruction. Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for DF/SFmode AND/IOR/XOR/ANDN operations.

[PATCH] Disparage slightly for the alternative which move DFmode between SSE_REGS and GENERAL_REGS.

2023-07-05 Thread liuhongt via Gcc-patches
For testcase void __cond_swap(double* __x, double* __y) { bool __r = (*__x < *__y); auto __tmp = __r ? *__x : *__y; *__y = __r ? *__y : *__x; *__x = __tmp; } GCC-14 with -O2 and -march=x86-64 options generates the following code: __cond_swap(double*, double*): movsd xmm1,

[PATCH] Break false dependence for vpternlog by inserting vpxor.

2023-07-03 Thread liuhongt via Gcc-patches
vpternlog is also used for optimization which doesn't need any valid input operand, in that case, the destination is used as input in the instruction and that creates a false dependence. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog: PR

[PATCH 1/2] Don't issue vzeroupper for vzeroupper call_insn.

2023-06-26 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/82735 * config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit vzeroupper for vzeroupper call_insn. gcc/testsuite/ChangeLog: *

[PATCH 2/2] Make option mvzeroupper independent of optimization level.

2023-06-26 Thread liuhongt via Gcc-patches
pass_insert_vzeroupper is under condition TARGET_AVX && TARGET_VZEROUPPER && flag_expensive_optimizations && !optimize_size But the document of mvzeroupper doesn't mention the insertion required -O2 and above, it may confuse users when they explicitly use -Os -mvzeroupper.

[PATCH] [x86] Refine maskstore patterns with UNSPEC_MASKMOV.

2023-06-26 Thread liuhongt via Gcc-patches
At the rtl level, we cannot guarantee that the maskstore is not optimized to other full-memory accesses, as the current implementations are equivalent in terms of pattern, to solve this potential problem, this patch refines the pattern of the maskstore and the intrinsics with unspec. One thing

[PATCH] Issue a warning for conversion between short and __bf16 under TARGET_AVX512BF16.

2023-06-26 Thread liuhongt via Gcc-patches
__bfloat16 is redefined from typedef short to real __bf16 since GCC V13. The patch issues an warning for potential silent implicit conversion between __bf16 and short where users may only expect a data movement. To avoid too many false positive, warning is only under TARGET_AVX512BF16.

[PATCH 1/3] Use cvt_op to save intermediate type operand instead of "subtle" vec_dest.

2023-06-25 Thread liuhongt via Gcc-patches
When there're multiple operands in vec_oprnds0, vec_dest will be overwrited to vectype_out, but in multi_step_cvt case, cvt_type is expected. It caused an ICE when verify_gimple_in_cfg. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and aarch64-linux-gnu. Ok for trunk? gcc/ChangeLog:

[PATCH 3/3] [aarch64] Adjust testcase to match assembly output after r14-2007.

2023-06-25 Thread liuhongt via Gcc-patches
The new assembly looks better than original one, so I adjust those testcases. Ok for trunk? gcc/testsuite/ChangeLog: PR tree-optimization/110371 PR tree-optimization/110018 * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Scan scvt + sxtw instead of scvt + zip1 +

[PATCH 2/3] Don't use intermiediate type for FIX_TRUNC_EXPR when ftrapping-math.

2023-06-25 Thread liuhongt via Gcc-patches
> > Hmm, good question. GENERIC has a direct truncation to unsigned char > > for example, the C standard generally says if the integral part cannot > > be represented then the behavior is undefined. So I think we should be > > safe here (0x1.0p32 doesn't fit an int). > > We should be following

[PATCH] Refine maskloadmn pattern with UNSPEC_MASKLOAD.

2023-06-20 Thread liuhongt via Gcc-patches
If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed to vpblendd. Bootstrapped and regtested on

[PATCH] [vect]Use intermiediate integer type for float_expr/fix_trunc_expr when direct optab is not existed.

2023-06-20 Thread liuhongt via Gcc-patches
I notice there's some refactor in vectorizable_conversion for code_helper,so I've adjusted my patch to that. Here's the patch I'm going to commit. We have already use intermidate type in case WIDEN, but not for NONE, this patch extended that. gcc/ChangeLog: PR target/110018 *

[PATCH 2/2] Refined 256/512-bit vpacksswb/vpackssdw patterns.

2023-06-15 Thread liuhongt via Gcc-patches
The packing in vpacksswb/vpackssdw is not a simple concat, it's an interweave from src1 and src2 for every 128 bit(or 64-bit for the ss_truncate result). .i.e. dst[192-255] = ss_truncate (src2[128-255]) dst[128-191] = ss_truncate (src1[128-255]) dst[64-127] = ss_truncate (src2[0-127]) dst[0-63]

[PATCH 1/2] Reimplement packuswb/packusdw with UNSPEC_US_TRUNCATE instead of original us_truncate.

2023-06-15 Thread liuhongt via Gcc-patches
packuswb/packusdw does unsigned saturation for signed source, but rtl us_truncate means does unsigned saturation for unsigned source. So for value -1, packuswb will produce 0, but us_truncate produces 255. The patch reimplement those related patterns and functions with UNSPEC_US_TRUNCATE instead

[PATCH] [x86] Use x instead of v for alternative 2 (v, BH) in mov_internal.

2023-06-13 Thread liuhongt via Gcc-patches
Since there's no evex version for vpcmpeq ymm, ymm, ymm. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk and backport to GCC13. gcc/ChangeLog: PR target/110227 * config/i386/sse.md (mov_internal>): Use x instead of v for alternative 2

[PATCH 1/2] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-06 Thread liuhongt via Gcc-patches
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE.

[PATCH v2] Explicitly view_convert_expr mask to signed type when folding pblendvb builtins.

2023-06-06 Thread liuhongt via Gcc-patches
> I think this is a better patch and will always be correct and still > get folded at the gimple level (correctly): > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index d4ff56ee8dd..02bf5ba93a5 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -18561,8

[PATCH] Don't fold _mm{, 256}_blendv_epi8 into (mask < 0 ? src1 : src2) when -funsigned-char.

2023-06-05 Thread liuhongt via Gcc-patches
Since mask < 0 will be always false when -funsigned-char, but vpblendvb needs to check the most significant bit. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to GCC12/GCC13 release branch? gcc/ChangeLog: PR target/110108 *

[PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-05 Thread liuhongt via Gcc-patches
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE.

[PATCH] [x86] Add missing vec_pack/unpacks patterns for _Float16 <-> int/float conversion.

2023-06-04 Thread liuhongt via Gcc-patches
This patch only support vec_pack/unpacks optabs for vector modes whose lenth >= 128. For 32/64-bit vector, they're more hanlded by BB vectorizer with truncmn2/extendmn2/fix{,uns}_truncmn2. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog:

[PATCH] [vect]Use intermiediate integer type for float_expr/fix_trunc_expr when direct optab is not existed.

2023-06-01 Thread liuhongt via Gcc-patches
We have already use intermidate type in case WIDEN, but not for NONE, this patch extended that. I didn't do that in pattern recog since we need to know whether the stmt belongs to any slp_node to decide the vectype, the related optabs are checked according to vectype_in and vectype_out. For

[PATCH] i386: Add missing vector truncate patterns [PR92658].

2023-06-01 Thread liuhongt via Gcc-patches
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector truncate. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/92658 * config/i386/mmx.md (truncv2hiv2qi2): New define_insn. (truncv2si2): Ditto.

[PATCH] Don't try bswap + rotate when TYPE_PRECISION(n->type) > n->range.

2023-06-01 Thread liuhongt via Gcc-patches
For the testcase in the PR, we have br64 = br; br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull); n->n: 0x300200. n->range: 32. n->type: uint64. The original code assumes n->range is same as TYPE PRECISION(n->type), and tries to rotate the mask from

[PATCH] Disable avoid_false_dep_for_bmi for atom and icelake(and later) core processors.

2023-05-25 Thread liuhongt via Gcc-patches
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since icelake. At least for icelake and later intel Core processors, the errata tune is not needed. And the tune isn't need for ATOM either. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk.

[PATCH] [x86] Split notl + pbraodcast + pand to pbroadcast + pandn more modes.

2023-05-25 Thread liuhongt via Gcc-patches
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to transform notl + pbroadcast + pand to pbroadcast + pandn for VI124_AVX2 which leaves out all DI-element-size ones as well as all 512-bit ones. This patch extend the splitter to VI_AVX2 which will handle DImode for AVX2, and

[PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABS_EXPR.

2023-05-22 Thread liuhongt via Gcc-patches
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/109900 * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and

[PATCH] Only use NO_REGS in cost calculation when !hard_regno_mode_ok for GENERAL_REGS and mode.

2023-05-17 Thread liuhongt via Gcc-patches
r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost calculation when the preferred register class are not known yet. It regressed powerpc PR109610 and PR109858, it looks too aggressive to use NO_REGS when mode can be allocated with GENERAL_REGS. The patch takes a step back, still

[PATCH V2] Provide -fcf-protection=branch,return.

2023-05-13 Thread liuhongt via Gcc-patches
> I think this could be simplified if you use either EnumSet or > EnumBitSet instead in common.opt for `-fcf-protection=`. Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2. It is a bit tricky for sets classification, cf_branch and cf_return should be in different sets, but they

[PATCH] Provide -fcf-protection=branch,return.

2023-05-11 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/89701 * common.opt: Refactor -fcf-protection= to support combination of param. * lto-wrapper.c (merge_and_complain): Adjusted. * opts.c

[PATCH] x86: Add a new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.

2023-05-10 Thread liuhongt via Gcc-patches
> The quoted patch shows -shared in context and you didn't post a > backport version > to look at. But yes, we shouldn't change -shared behavior on a > branch, even less so make it > inconsistent between targets. Here's the patch. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for

[PATCH] Detect bswap + rotate for byte permutation in pass_bswap.

2023-05-09 Thread liuhongt via Gcc-patches
The patch doesn't handle: 1. cast64_to_32, 2. memory source with rsize < range. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR middle-end/108938 * gimple-ssa-store-merging.cc (is_bswap_or_nop_p): New function, cut from

[PATCH V2] [vect]Enhance NARROW FLOAT_EXPR vectorization by truncating integer to lower precision.

2023-05-07 Thread liuhongt via Gcc-patches
> > @@ -4799,7 +4800,8 @@ vect_create_vectorized_demotion_stmts (vec_info > > *vinfo, vec *vec_oprnds, > >stmt_vec_info stmt_info, > >vec _dsts, > >gimple_stmt_iterator *gsi, >

[PATCH] [powerpc] Add a peephole2 to eliminate redundant move from VSX_REGS to GENERAL_REGS when it's from memory.

2023-05-03 Thread liuhongt via Gcc-patches
r14-172-g0368d169492017 use NO_REGS instead of GENERAL_REGS in memory cost calculation when preferred register class is unkown. + /* Costs for NO_REGS are used in cost calculation on the +1st pass when the preferred register classes are not +known yet. In this case we take

[PATCH v2] Canonicalize vec_merge when mask is constant.

2023-05-03 Thread liuhongt via Gcc-patches
Here's update patch with documents in md.texi. Ok for trunk? -- Use swap_communattive_operands_p for canonicalization. When both value has same operand precedence value, then first bit in the mask should select first operand. The canonicalization should help backends for pattern

[PATCH] [vect]Enhance NARROW FLOAT_EXPR vectorization by truncating integer to lower precision.

2023-04-26 Thread liuhongt via Gcc-patches
Similar like WIDEN FLOAT_EXPR, when direct_optab is not existed, try intermediate integer type whenever gimple ranger can tell it's safe. .i.e. When there's no direct optab for vector long long -> vector float, but the value range of integer can be represented as int, try vector int -> vector

[PATCH] Add testcases for ffs/ctz vectorization.

2023-04-22 Thread liuhongt via Gcc-patches
Ready push to trunk. gcc/testsuite/ChangeLog: PR tree-optimization/109011 * gcc.target/i386/pr109011-b1.c: New test. * gcc.target/i386/pr109011-b2.c: New test. * gcc.target/i386/pr109011-d1.c: New test. * gcc.target/i386/pr109011-d2.c: New test. *

[PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-04-21 Thread liuhongt via Gcc-patches
> > + if (!TARGET_SSE2) > > +{ > > + if (c_dialect_cxx () > > + && cxx_dialect > cxx20) > > Formatting, both conditions are short, so just put them on one line. Changed. > But for the C++23 macros, more importantly I think we really should > also in ix86_target_macros_internal add

[PATCH 2/2] [i386] def_or_undef __STDCPP_FLOAT16_T__ and __STDCPP_BFLOAT16_T__ for target attribute/pragmas.

2023-04-21 Thread liuhongt via Gcc-patches
> But for the C++23 macros, more importantly I think we really should > also in ix86_target_macros_internal add > if (c_dialect_cxx () > && cxx_dialect > cxx20 > && (isa_flag & OPTION_MASK_ISA_SSE2)) > { > def_or_undef (parse_in, "__STDCPP_FLOAT16_T__"); >

[PATCH] Canonicalize vec_merge when mask is constant.

2023-04-19 Thread liuhongt via Gcc-patches
Use swap_communattive_operands_p for canonicalization. When both value has same operand precedence value, then first bit in the mask should select first operand. The canonicalization should help backends for pattern match. .i.e. x86 backend has lots of vec_merge patterns, combine will create any

[PATCH 2/2] Adjust testcases after better RA decision.

2023-04-19 Thread liuhongt via Gcc-patches
After optimization for RA, memory op is not propagated into instructions(>1), and it make testcases not generate vxorps since the memory is loaded into the dest, and the dest is never unused now. So rewrite testcases to make the codegen more stable. gcc/testsuite/ChangeLog: *

[PATCH 1/2] Use NO_REGS in cost calculation when the preferred register class are not known yet.

2023-04-19 Thread liuhongt via Gcc-patches
1547 /* If this insn loads a parameter from its stack slot, then it 1548 represents a savings, rather than a cost, if the parameter is 1549 stored in memory. Record this fact. 1550 1551 Similarly if we're loading other constants from memory (constant 1552 pool, TOC references,

[PATCH] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-04-19 Thread liuhongt via Gcc-patches
-Jakub's comments-- That said, these fundamental types whose presence/absence depends on ISA flags are quite problematic IMHO, as they are incompatible with the target attribute/pragmas. Whether they are available or not available depends on whether in this case SSE2 is enabled during

[PATCH] Check hard_regno_mode_ok before setting lowest memory move cost for the mode with different reg classes.

2023-04-03 Thread liuhongt via Gcc-patches
There's a potential performance issue when backend returns some unreasonable value for the mode which can be never be allocate with reg class. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk(or GCC14 stage1)? gcc/ChangeLog: PR rtl-optimization/109351 *

[PATCH] Document signbitm2.

2023-03-31 Thread liuhongt via Gcc-patches
Look through all backends which defined signbitm2. 1. When m is a scalar mode, the dest is SImode. 2. When m is a vector mode, the dest mode is the vector integer mode has the same size and elements number as m. Ok for trunk? gcc/ChangeLog: * doc/md.texi: Document signbitm2. ---

[PATCH] Adjust memory_move_cost for MASK_REGS when MODE_SIZE > 8.

2023-03-30 Thread liuhongt via Gcc-patches
RA sometimes will use lowest the cost of the mode with all different regclasses w/o check if it's hard_regno_mode_ok. It's impossible to put modes whose size > 8 into MASK_REGS, ajdust the cost to avoid potential performance issue. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for

[PATCH V2] Rename ufix_trunc/ufloat* patterns to fixuns_trunc/floatuns* to align with standard pattern name.

2023-03-30 Thread liuhongt via Gcc-patches
> > Just rename the instruction and fix all its call sites. The name of > > the insn pattern is internal to the compiler and can be renamed at > > will. > > Ideally, we should standardize all the names to a standard name, so > e.g. ufix_ -> fixuns_ and ufloat -> floatuns. Updated. There's some

[PATCH] Support vector conversion for AVX512 vcvtudq2pd/vcvttps2udq/vcvttpd2udq.

2023-03-29 Thread liuhongt via Gcc-patches
There's some typo for the standard pattern name for unsigned_{float,fix}, it should be floatunsmn2/fixuns_truncmn2, not ufloatmn2/ufix_truncmn2 in current trunk, the patch fix the typo. Also vcvttps2udq is available under AVX512VL, so it can be generated directly instead of being emulated via

[PATCH] Generate vpblendd instead of vpblendw for V4SI under AVX2.

2023-03-29 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ok for GCC14 stage-1(or maybe trunk)? gcc/ChangeLog: * config/i386/i386-expand.cc (expand_vec_perm_blend): Generate vpblendd instead of vpblendw for V4SI under avx2. gcc/testsuite/ChangeLog: *

[PATCH] Remove TARGET_GEN_MEMSET_SCRATCH_RTX since it's not used anymore.

2023-03-21 Thread liuhongt via Gcc-patches
The target hook is only used by i386, and the current definition is same as default gen_reg_rtx. So there's no need for this target hook. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk(or GCC14)? gcc/ChangeLog: * builtins.cc (builtin_memset_read_str): Replace

[PATCH] [vect] Don't peel nonlinear iv(mult or shift) for epilog when vf is not constant.

2023-02-01 Thread liuhongt via Gcc-patches
Normally when vf is not constant, it will be prevented by vectorizable_nonlinear_inductions, but for this case, it failed going into if (STMT_VINFO_RELEVANT_P (stmt_info)) { need_to_vectorize = true; if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def &&

[PATCH] Change AVX512FP16 to AVX512-FP16 which is official name.

2023-01-28 Thread liuhongt via Gcc-patches
Ready to push to trunk. --- htdocs/gcc-12/changes.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html index 30fa4d6e..49055ffe 100644 --- a/htdocs/gcc-12/changes.html +++ b/htdocs/gcc-12/changes.html @@ -754,7 +754,7 @@

[PATCH] Change AVX512FP16 to AVX512-FP16 in the document.

2023-01-28 Thread liuhongt via Gcc-patches
The official name is AVX512-FP16. Ready to push to trunk. gcc/ChangeLog: * config/i386/i386.opt: Change AVX512FP16 to AVX512-FP16. * doc/invoke.texi: Ditto. --- gcc/config/i386/i386.opt | 2 +- gcc/doc/invoke.texi | 6 +++--- 2 files changed, 4 insertions(+), 4

[PATCH] Don't add crtfastmath.o for -shared.

2023-01-13 Thread liuhongt via Gcc-patches
Patches [1] and [2] fixed PR55522 for x86-linux but left all other x86 targets unfixed (x86-cygwin, x86-darwin and x86-mingw32). This patch applies a similar change to other specs using crtfastmath.o. Ok for trunk? [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608528.html [2]

[PATCH V2 2/2] [x86] x86: Add a new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.

2022-12-14 Thread liuhongt via Gcc-patches
Update in v2: 1. Support -mno-daz-ftz, and make the the option effectively three state as: if (mdaz-ftz) link crtfastmath.o else if ((Ofast || ffast-math || funsafe-math-optimizations) && !shared && !mno-daz-ftz) link crtfastmath.o else Don't link crtfastmath.o 2. Still make the

[PATCH V2 1/2] x86: Don't add crtfastmath.o for -shared

2022-12-14 Thread liuhongt via Gcc-patches
Update in V2: Split -shared change into a separate commit and add some documentation for it. Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. Ok of trunk? Don't add crtfastmath.o for -shared to avoid changing the MXCSR register when loading a shared library. crtfastmath.o will be used

[PATCH] [x86] x86: Don't add crtfastmath.o for -shared and add a new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.

2022-12-13 Thread liuhongt via Gcc-patches
Don't add crtfastmath.o for -shared to avoid changing the MXCSR register when loading a shared library. crtfastmath.o will be used only when building executables. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/55522 PR

[PATCH] [x86] Fix ICE due to condition mismatch between expander and define_insn.

2022-12-06 Thread liuhongt via Gcc-patches
ice.i:7:1: error: unrecognizable insn: 7 | } | ^ (insn 7 6 8 2 (set (reg:V2SF 84 [ vect__3.8 ]) (unspec:V2SF [ (reg:V2SF 86 [ vect__1.7 ]) (const_int 11 [0xb]) ] UNSPEC_ROUND)) "ice.i":5:14 -1 (nil)) during RTL pass: vregs

[PATCH] [x86] Improve ix86_expand_fast_convert_bf_to_sf with new extendbfsf2_1.

2022-12-01 Thread liuhongt via Gcc-patches
After supporting extendbfsf2_1, ix86_expand_fast_convert_bf_to_sf can be improved with pslld either. CONST_INT_P is not handled since constant shift can be optimized off. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/i386-expand.cc

[PATCH] [x86] Fix ICE due to incorrect insn type.

2022-11-30 Thread liuhongt via Gcc-patches
;; if reg/mem op (define_insn_reservation "slm_sseishft_3" 2 (and (eq_attr "cpu" "slm") (and (eq_attr "type" "sseishft") (not (match_operand 2 "immediate_operand" "slm-complex, slm-all-eu") in slm.md it will check operands[2] for type sseishft, but for extendbfsf2_1

[PATCH 1/2 V2] Implement hwasan target_hook.

2022-11-29 Thread liuhongt via Gcc-patches
Update in V2: Add documentation for -mlam={none,u48,u57} to x86 options in invoke.texi. gcc/ChangeLog: * doc/invoke.texi (x86 options): Document -mlam={none,u48,u57}. * config/i386/i386-opts.h (enum lam_type): New enum. * config/i386/i386.c

[PATCH] [x86] Fix unrecognizable insn due to illegal immediate_operand (const_int 255) of QImode.

2022-11-28 Thread liuhongt via Gcc-patches
For __builtin_ia32_vec_set_v16qi (a, -1, 2) with !flag_signed_char. it's transformed to __builtin_ia32_vec_set_v16qi (_4, 255, 2) in the gimple, and expanded to (const_int 255) in the rtl. But for immediate_operand, it expects (const_int 255) to be signed extended to (const_int -1). The mismatch

[PATCH V3] [x86] Fix incorrect _mm_cvtsbh_ss.

2022-11-24 Thread liuhongt via Gcc-patches
Update in V3: Remove !flag_signaling_nans since there's already HONOR_NANS (BFmode). Here's the patch: After supporting real __bf16, the implementation of _mm_cvtsbh_ss went wrong. The patch add a builtin to generate pslld for the intrinsic, also extendbfsf2 is supported with pslld when

[PATCH v2] [x86] Fix incorrect _mm_cvtsbh_ss.

2022-11-23 Thread liuhongt via Gcc-patches
After supporting real __bf16, the implementation of _mm_cvtsbh_ss went wrong. The patch add a builtin to generate pslld for the intrinsic, also extendbfsf2 is supported with pslld when !flag_signaling_nans && !HONOR_NANS (BFmode). truncsfbf2 is supported with vcvtneps2bf16 when

  1   2   3   4   5   >