(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
For below pattern, RA may still allocate r162 as v/k register, try to
reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
which result a linker error.
(set (reg:DI 162)
(mem/u/c:DI
(const:DI (unspec:DI
[(symbol_ref:DI ("a") [flags 0x60] )]
ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4
>The solution proposed here is to have the x86 backend/recog prevent
>early RTL passes composing instructions (that set likely_spilled hard
>registers) that they (combine) can't simplify, until after reload.
>We allow sets
> Also, in case the insn is deleted, do:
>
> emit_note (NOTE_INSN_DELETED);
>
> DONE;
>
> instead of leaving (const_int 0) in the stream.
>
> So, the above insn preparation statements should read:
>
> --cut here--
> if (constm1_operand (operands[2], mode))
> emit_move_insn (operands[0],
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/115843
* config/i386/predicates.md (const0_or_m1_operand): New
predicate.
* config/i386/sse.md (*_store_mask_1): New
pre_reload define_insn_and_split.
>- _5 = __atomic_fetch_or_8 (_work_pending_p, 1, 0);
>- # DEBUG old => (long int) _5
>+ _6 = .ATOMIC_BIT_TEST_AND_SET (_work_pending_p, 0, 1, 0,
>__atomic_fetch_or_8);
>+ # DEBUG old => NULL
> # DEBUG BEGIN_STMT
>- # DEBUG D#2 => _5 & 1
>+ # DEBUG D#2 => NULL
>...
>- _10 = ~_5;
>- _8 =
I have a build failure on NetBSD as the namespace pollution avoidance causes
a direct hit with the system /usr/include/math.h
===
In file included from /usr/src/local/gcc/obj/gcc/include/emmintrin.h:31,
from
>> Hmm, now all avx512 tests SIGILL when testing with -m32:
>>
>> Dump of assembler code for function __get_cpuid_count:
>> => 0x08049500 <+0>: kmovd %eax,%k2
>> 0x08049504 <+4>: kmovd %edx,%k1
>> 0x08049508 <+8>: pushf
>> 0x08049509 <+9>: pushf
>> 0x0804950a <+10>:
From: "H.J. Lu"
>The above reads like it would be worth splitting branc_prediction_hits
>into branch_prediction_hints_taken and branch_prediction_hints_not_taken
>given not-taken is the default and thus will just increase code size?
>According to Intel® 64 and IA-32 Architectures Optimization
The patch can avoid SIGILL on non-AVX512 machine due to kmovd is
generated in dynamic check.
Committed as an obvious fix.
gcc/testsuite/ChangeLog:
PR target/115748
* gcc.target/i386/avx512-check.h: Move runtime check into a
separate function and guard it with target
From: "H.J. Lu"
According to Intel® 64 and IA-32 Architectures Optimization Reference
Manual[1], Branch Hint is updated for Redwood Cove.
cut from [1]-
Starting with the Redwood Cove microarchitecture, if the predictor has
no stored information about a branch,
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
cause extra mov between gpr and kmask, add ?k to the pattern.
gcc/ChangeLog:
PR target/115610
* config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
enable it only for lshiftrt and under avx512bw.
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
define target_insn_cost to prevent post_reload pass_late_combine to
revert the optimziation did in pass_rpad.
Adjust testcases since pass_late_combine generates better code but
break scan assembly.
.i.e
Under 32-bit target,
operation.
After enabling flate_combine, they're combined into embeded broadcast
operations.
Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means
there're lots of small improvements.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok for trunk?
liuhongt (3
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations
with k mask, but mask is defined as uninitialized local variable which
will be set as 0 at rtl expand phase.
And it's further simplified off by late_combine which caused scan assembly
failure.
Move the definition of mask outside
for the testcase in the PR115406, here is part of the dump.
char D.4882;
vector(1) _1;
vector(1) signed char _2;
char _5;
:
_1 = { -1 };
When assign { -1 } to vector(1} {signed-boolean:8},
Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
with each vector
gcc/ChangeLog:
PR target/115517
* config/i386/mmx.md (vcondv2sf): Removed.
(vcond): Ditto.
(vcond): Ditto.
(vcondu): Ditto.
(vcondu): Ditto.
* config/i386/sse.md (vcond): Ditto.
(vcond): Ditto.
(vcond): Ditto.
> Richard suggests that we implement the "obvious" transforms like
> inversion in the middle-end but if for example unsigned compares
> are not supported the us_minus + eq + negative trick isn't on
> that list.
>
> The main reason to restrict vec_cmp would be to avoid
> a <= b ? c : d going with
gcc/ChangeLog
PR target/115517
* config/i386/sse.md
(*_cvtmask2_not): New pre_reload
splitter.
(*_cvtmask2_not): Ditto.
(*avx2_pcmp3_6): Ditto.
(*avx2_pcmp3_7): Ditto.
---
gcc/config/i386/sse.md | 97
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_movmsk_lt_avx512): New
define_insn_and_split.
(*_movmsk_ext_lt_avx512):
Ditto.
(*_pmovmskb_lt_avx512): Ditto.
(*_pmovmskb_zext_lt_avx512): Ditto.
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Add define_insn_and_split for the optimization did in
ix86_expand_int_vcond.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md ("*ashr3_1"): New
define_insn_and_split.
These define_insn_and_split are needed after vcond{,u,eq} is obsolete.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_blendv_gt): New
define_insn_and_split.
(*_blendv_gtint):
Ditto.
(*_blendv_not_gtint):
Ditto.
x86-64 -O2
-march=sapphirerapids -O2
Didn't observe obvious performance change, mostly same binaries.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comments?
liuhongt (7):
[x86] Add more splitters to match (unspec [op1 op2 (gt op3
constm1_operand)] UNSPEC_BLENDV)
Lower AV
These versions of the min/max patterns implement exactly the operations
min = (op1 < op2 ? op1 : op2)
max = (!(op1 < op2) ? op1 : op2)
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md (*minmax3_1): New pre_reload
define_insn_and_split.
(*minmax3_2):
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> recursive processing at any level. You're dealing with MEM [addr]
> here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> the best way to deal with this? Since this is the MEM [addr] case
> we know it's not LEA, no?
416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
It is the case in the PR, the patch uses lower cost to enable more
simplication and fix
Here's the patch committed.
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
gcc/ChangeLog:
PR target/114189
* match.pd: Simplify a < 0 ? -1 : 0 to (signed) >> 31 and a
> I think the check for TYPE_UNSIGNED should be of TREE_TYPE (@0) rather
> than type here.
Changed
> Or maybe you need `types_match (type, TREE_TYPE (@0))` too.
And use tree_nop_conversion_p (type, TREE_TYPE (@0)) and add view_convert to
rshift.
Bootstrapped and regtested on
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, aarch64-linux-gnu.
Ok for trunk?
gcc/ChangeLog:
PR target/114189
The tune is added by PR79390 for SciMark2 on Broadwell.
For latest GCC, with and without the -mtune-ctrl=^one_if_conv_insn.
GCC will generate the same binary for SciMark2. And for SPEC2017,
there's no big impact for SKX/CLX/ICX, and small improvements on SPR
and later.
gcc/ChangeLog:
*
Use reg_or_subregno instead.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Committed as an obvious patch.
gcc/ChangeLog:
PR target/115452
* config/i386/i386-features.cc (scalar_chain::convert_op): Use
reg_or_subregno instead of REGNO to avoid ICE.
r15-1100-gec985bc97a0157 improves handling of ternlog instructions,
now GCC can recognize lots of pternlog_operand with different
variants.
The patch adjust rtx_costs for that, so pass_combine can
reasonably generate more optimal vpternlog instructions.
.i.e
for avx512f-vpternlog-3.c, with the
>
> I think if you only handle CONST_INT_P, you should check just for that, and
> in both places where you check for CONST_VECTOR_DUPLICATE_P (there is one
> spot 2 lines above this).
> So add
> && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0))
> and
> && CONST_INT_P (XVECEXP (op1, 0, 0))
> tests
In theory, const_wide_int can also be handle with extra check for each
components of the HOST_WIDE_INT array, and the check is need for both
shift and bit_and operands.
I assume the optimization opportnunity is rare, so the patch just add
extra check to make sure GET_MODE_INNER (mode) can fix
gcc/testsuite/ChangeLog:
* gcc.dg/vect/pr112325.c:Add additional option --param
max-completely-peeled-insns=200 for power64*-*-*.
---
gcc/testsuite/gcc.dg/vect/pr112325.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c
For power10, there're extra 3 REG_EQUIV notes with (fix:SI. to avoid
the failure. Check (fix:SI is from the pattern not NOTE.
gcc/testsuite/ChangeLog:
PR target/115365
* gcc.dg/pr100927.c: Don't scan fix:SI from the note.
---
gcc/testsuite/gcc.dg/pr100927.c | 2 +-
1 file
> Can you add a testcase for this? I don't mind if it's x86 specific and
> does a bit of asm scanning.
>
> Also note that the context for this patch has changed, so it won't
> automatically apply. So be extra careful when updating so that it goes
> into the right place (all the more reason to
Commit as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/115299
* gcc.target/i386/pr86722.c: Also scan for blendvpd.
---
gcc/testsuite/gcc.target/i386/pr86722.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/testsuite/gcc.target/i386/pr86722.c
W/o TARGET_SSE4_1, it takes 3 instructions (pand, pandn and por) for
movdfcc/movsfcc, and could possibly fail cost comparison. Increase
branch cost could hurt performance for other modes, so specially add
some preference for floating point ifcvt.
Bootstrapped and regtested on
Committed as an obvious patch.
gcc/ChangeLog:
* config/i386/emmintrin.h (__double_u): Rename from double_u.
(_mm_load_sd): Replace double_u with __double_u.
(_mm_store_sd): Ditto.
(_mm_loadh_pd): Ditto.
(_mm_loadl_pd): Ditto.
*
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (vcond_mask_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr114125.c: New test.
---
gcc/config/i386/sse.md | 20
> IMO, there is no need for CONST_INT_P condition, we should also allow
> symbol_ref, label_ref and const (all allowed by
> x86_64_immediate_operand predicate), these all decay to an immediate
> value.
Changed.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk.
For MEM,
When I applied Roger's patch [1], there's ICE due to it.
The patch fix the latent bug.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651365.html
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/sse.md
(___mask):
For MEM, rtx_cost iterates each subrtx, and adds up the costs,
so for MEM (reg) and MEM (reg + 4), the former costs 5,
the latter costs 9, it is not accurate for x86. Ideally
address_cost should be used, but it reduce cost too much.
So current solution is make constant disp as cheap as possible.
Update in V2:
Guard constant folding for overflow value in
fold_convert_const_int_from_real with flag_trapping_math.
Add -fno-trapping-math to related testcases which warn for overflow
in conversion from floating point to integer.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for
Committed as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/114148
* gcc.target/i386/pr106010-7b.c: Refine testcase.
---
gcc/testsuite/gcc.target/i386/pr106010-7b.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git
Update in V3:
> Since this was about vectorization can you instead add a testcase to
> gcc.dg/vect/ and check for
> vectorization to happen?
Move to vect/pr112325.c.
>
> I believe the if (unr_insn <= 0) check can go as well.
Removed.
> as said, you want to do
>
> curolli = false;
>
>
>> Hard to find a default value satisfying all testcases.
>> some require loop unroll with 7 insns increment, some don't want loop
>> unroll w/ 5 insn increment.
>> The original 2/3 reduction happened to meet all those testcases(or the
>> testcases are constructed based on the old 2/3).
>> Can we
According to IEEE standard, for conversions from floating point to
integer. When a NaN or infinite operand cannot be represented in the
destination format and this cannot otherwise be indicated, the invalid
operation exception shall be signaled. When a numeric operand would
convert to an integer
For CONST_VECTOR_DUPLICATE_P in constant_pool, it is just broadcast or
variants in ix86_vector_duplicate_simode_const.
Adjust the cost to COSTS_N_INSNS (2) + speed which should be a little
bit larger than broadcast.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
When mask is (1 << (prec - imm) - 1) which is used to clear upper bits
of A, then it can be simplified to LSHIFTRT.
i.e Simplify
(and:v8hi
(ashifrt:v8hi A 8)
(const_vector 0xff x8))
to
(lshifrt:v8hi A 8)
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok of trunk?
gcc/ChangeLog:
For vec_pack_truncv8si/v4si w/o AVX512,
(const_vector:v4si (const_int 0x) x4) is used as mask to clear
upper 16 bits, but vpblendw with zero_vector can also be used, and
zero vector is cheaper than (const_vector:v4si (const_int 0x) x4).
Bootstrapped and regtested on
pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1
could return false.
With the patch under -march=x86-64-v2
v8qi
foo (v8qi a)
{
return a >> 5;
}
<
Since there is no corresponding instruction, the shift operation for
vector int8 is implemented using the instructions for vector int16,
but for some special shift counts, it can be transformed into vpcmpgtb.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
As testcase in the PR, O3 cunrolli may prevent vectorization for the
innermost loop and increase register pressure.
The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
the reduction.
The Fortran standard does not specify what the result of the MAX
and MIN intrinsics are if one of the arguments is a NaN. So it
should be ok to tranform reduction for IFN_COND_MIN with vectorized
COND_MIN and REDUC_MIN.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready push to trunk.
gcc/ChangeLog:
PR target/113090
* config/i386/i386-expand.cc
(expand_vec_perm_punpckldq_pshuf): New function.
(ix86_expand_vec_perm_const_1): Try
expand_vec_perm_punpckldq_pshuf
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/113079
* config/i386/mmx.md (usdot_prodv8qi): New expander.
(sdot_prodv8qi): Ditto.
(udot_prodv8qi): Ditto.
(usdot_prodv4hi): Ditto.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (usdot_prodv*qi): Extend to VI1_AVX512
with vpmaddwd when avxvnni/avx512vnni is not available.
---
gcc/config/i386/sse.md | 55
The Intel Decimal Floating-Point Math Library is available as open-source on
Netlib[1].
[1] https://www.netlib.org/misc/intel/.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
libgcc/config/libbid/ChangeLog:
* bid128_fma.c (add_and_round): Fix bug: the result
So when both source operand and dest operand require avx512 MASK_REGS, RA
can allocate MASK_REGS register instead of GPR to avoid reload it from
GPR to MASK_REGS.
It's similar as what did for logic patterns.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
> > So, try to add some other variable with larger size and smaller alignment
> > to the frame (and make sure it isn't optimized away).
> >
> > alignb above is the alignment of the first partition's var, if
> > align_frame_offset really needs to depend on the var alignment, it probably
> > should
Also fixed a typo in the testcase.
Commit as an obvious fix.
gcc/testsuite/ChangeLog:
PR tree-optimization/114396
* gcc.target/i386/pr114396.c: Move to...
* gcc.c-torture/execute/pr114396.c: ...here.
---
.../{gcc.target/i386 => gcc.c-torture/execute}/pr114396.c | 6
wi::from_mpz doesn't take a sign argument, we want it to be wrapped
instead of saturation, so pass utype and true to it, and it fixes the
bug.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to gcc13?
gcc/ChangeLog:
PR tree-optimization/114396
gcc/ChangeLog:
* doc/invoke.texi: Document -fexcess-precision=16.
---
gcc/doc/invoke.texi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 85c938d4a14..6bc1ebf9721 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -14930,6
Ok for trunk?
gcc/ChangeLog:
* doc/invoke.texi: Document -fexcess-precision=16.
---
gcc/doc/invoke.texi | 3 +++
1 file changed, 3 insertions(+)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 85c938d4a14..673420fdd3e 100644
--- a/gcc/doc/invoke.texi
+++
Commit r14-9459-g618e34d56cc38e only handles
general_scalar_chain::convert_op. The patch also handles
timode_scalar_chain::convert_op to avoid potential similar bug.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and backport to releases/gcc-13 branch?
gcc/ChangeLog:
It fixes ICE of unrecognized logic operation insn which is generated by
lroundmn2 expanders.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/114334
* config/i386/i386.md (mode): Add new number V8BF,V16BF,V32BF.
When we split
(insn 37 36 38 10 (set (reg:DI 104 [ _18 ])
(mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct
SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 84
{*movdi_internal}
(expr_list:REG_EH_REGION (const_int -11 [0xfff5])
if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of
alignb. (base_align_bias - base_offset) may not aligned to alignb, and
caused segement fault.
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
Ok for trunk and backport to GCC13?
gcc/ChangeLog:
PR sanitizer/110027
target maybe_x32 doesn't check if platform has gnu/stubs-x32.h, but
it's included by stdint.h in the testcase.
Adjust testcase: remove stdint.h, use 'typedef long long int64_t'
instead.
Commit as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/113711
*
---
htdocs/gcc-14/changes.html | 5 +
1 file changed, 5 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 6d917535..a022357a 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -499,6 +499,11 @@ a work-in-progress.
There're 2 cases:
1. hwasan-poison-optimisation.c is supposed to scan call to
__hwasan_tag_mismatch4, and x86 have different mnemonic(call) from
aarch64(bl), so adjust testcase to scan either call or bl.
2. alloca-outside-caught.c/vararray-outside-caught.c are supposed to
scan mismatched tags and
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386-options.cc (ix86_option_override_internal):
Enable -mlam=u57 by default when compiled with
-fsanitize=hwaddress.
---
gcc/config/i386/i386-options.cc | 9 +
1 file changed, 9 insertions(+)
diff --git
After vect_early_break is supported, more vectorization is enabled(3
COPYSIGN), so adjust testcase for that.
Commit as obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/part-vect-copysignhf.c: Remove
-ftree-vectorize from dg-options.
---
After r14-7124-g6686e16fda4190, the testcase can be optimized to
MAX_EXPR if the backends support that. So I adjust the testcase to
scan for MAX_EXPR, but it failed many platforms which don't support
that.
As pinski mentioned, target vect_no_int_min_max is only available
under vect directory, so
To override -fcf-protection, -fcf-protection=none needs to be added
and then with -fcf-protection=xxx.
---
htdocs/gcc-14/changes.html | 6 ++
1 file changed, 6 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index e3a68998..72b0d291 100644
---
After r14-2692-g1c6231c05bdcca, the option is defined as EnumSet and
-fcf-protection=branch won't unset any others bits since they're in
different groups. So to override -fcf-protection, an explicit
-fcf-protection=none needs to be added and then with
-fcf-protection=XXX
Bootstrapped and
> I wonder if you can amend the existing patterns instead by iterating
> over cond/vec_cond. There are quite some (look for uses of
> minmax_from_comparison) that could be adapted to vectors.
>
> The ones matching the simple form you match are
>
> #if GIMPLE
> /* A >= B ? A : B -> max (A, B) and
Similar for A < B ? B : A to MAX_EXPR.
There're codes in the frontend to optimize such pattern but failed to
handle testcase in the PR since it's exposed at gimple level when
folding backend builtins.
pr95906 now can be optimized to MAX_EXPR as it's commented in the
testcase.
// FIXME: this
vpbroadcastd/vpbroadcastq is avaiable under TARGET_AVX2, but
vec_dup{v4di,v8si} pattern is avaiable under AVX with memory operand.
And it will cause LRA/Reload to generate spill and reload if we put
constant in register.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to
x86 doesn't support horizontal reduction instructions, reduc_op_scal_m
is emulated with vec_extract_half + op(half vector length)
Take that into account when calculating cost for vectorization.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
No big performance impact on SPEC2017 as
> since you are looking at TYPE_PRECISION below you want
> VECTOR_INTIEGER_TYPE_P here as well? The alternative
> would be to compare TYPE_SIZE.
>
> Some of the checks feel redundant but are probably good for
> documentation purposes.
>
> OK with using VECTOR_INTIEGER_TYPE_P
Actually, the data
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112904
* config/i386/mmx.md (*xop_pcmov_): New define_insn.
gcc/testsuite/ChangeLog:
* g++.target/i386/pr112904.C: New test.
---
gcc/config/i386/mmx.md
If the function desn't clobber any sse registers or only clobber
128-bit part, then vzeroupper isn't issued before the function exit.
the status not CLEAN but ANY after the function.
Also for sibling_call, it's safe to issue an vzeroupper. Also there
could be missing vzeroupper since there's no
Like r14-5990-gb4a7c1c8c59d19, but the patch optimized for udot_prod.
Since (zero_extend) (unsigned char)-> int is equal
to (zero_extend)(unsigned char) -> short
+ (sign_extend) (short) -> int
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
It should be safe to
.i.e. for below cases.
a[0] = b1;
a[1] = b2;
..
a[n] = bn;
There're extra dependences when contructing the vector, but not for
scalar store. According to experiments, it's generally worse.
The patch adds an cut-off heuristic when vec_stmt is just
vec_construct and vector store. It
> Hmm, I would suggest you put reg_needed into the class and accumulate
> over all vec_construct, with your patch you pessimize a single v32qi
> over two separate v16qi for example. Also currently the whole block is
> gated with INTEGRAL_TYPE_P but register pressure would be also
> a concern for
Loop vectorizer will use vec_perm to select lower part of a vector,
there could be some redundancy when using subreg in
reduc__scal_m, because rtl cse can't figure out vec_select lower
part is just subreg.
I'm trying to canonicalize vec_select to subreg like aarch64 did, but
there're so many
Currently sdot_prodv*qi is available under TARGET_AVXVNNIINT8, but it
can be emulated by
vec_unpacks_lo_v32qi
vec_unpacks_lo_v32qi
vec_unpacks_hi_v32qi
vec_unpacks_hi_v32qi
sdot_prodv16hi
sdot_prodv16hi
add3v8si
which is faster than original
vect_patt_39.11_48 = WIDEN_MULT_LO_EXPR ;
For vec_contruct, the components must be live at the same time if
they're not loaded from memory, when the number of those components
exceeds available registers, spill happens. Try to account that with a
rough estimation.
??? Ideally, we should have an overall estimation of register pressure
if
From: "Zhang, Annita"
Avoid_fma_chain was enabled in m_SAPPHIRERAPIDS, m_ALDERLAKE and
m_CORE_HYBRID. It can also be enabled in m_GENERIC to improve the
performance of -march=x86-64-v3/v4 with -mtune=generic set by
default. One SPEC2017 benchmark 510.parest_r can improve greatly due
to it. From
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112325
* config/i386/i386-expand.cc (emit_reduc_half): Hanlde
V8QImode.
* config/i386/mmx.md (reduc__scal_): New expander.
(reduc__scal_v4qi): Ditto.
The missing cbranchv*{hi,qi}4 maybe needed by early break vectorization.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (cbranch4): Extend to Vector
HI/QImode.
---
gcc/config/i386/sse.md | 10 --
1 file
BB vectorizer relies on the backend support of
.REDUC_{PLUS,IOR,XOR,AND} to vectorize reduction.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/112325
* config/i386/sse.md (reduc__scal_): New expander.
x86 backend support reduc_{and,ior,xor>_scal_m for vector integer
modes.
Ok for trunk?
gcc/testsuite/ChangeLog:
* lib/target-supports.exp (vect_logical_reduc): Add i?86-*-*
and x86_64-*-*.
---
gcc/testsuite/lib/target-supports.exp | 3 ++-
1 file changed, 2 insertions(+), 1
Update in V2:
1) Add some comments before the pattern.
2) Remove ? from view_convert.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
When I'm working on PR112443, I notice there's some misoptimizations:
after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
The new added splitter will generate
(insn 58 56 59 2 (set (reg:V4HI 20 xmm0 [129])
(vec_duplicate:V4HI (reg:HI 22 xmm2 [123]))) "testcase.c":16:21 -1
But we only have
(define_insn "*vec_dupv4hi"
[(set (match_operand:V4HI 0 "register_operand" "=y,Yw")
(vec_duplicate:V4HI
if (TREE_CODE (init_expr) == INTEGER_CST)
init_expr = fold_convert (TREE_TYPE (vectype), init_expr);
else
gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype),
TREE_TYPE (init_expr)));
and init_expr is a 24 bit integer type while vectype has
1 - 100 of 532 matches
Mail list logo