Disable the tune for Zhaoxin/CLX/SKX since it could hurt performance
for the inner loop.
According to last test, align_loop helps performance for SPEC2017 on EMR and
Znver4.
So I'll still keep the tune for generic part.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comment?
gcc/
Force_operand issues an ICE when input
is (subreg:DI (us_truncate:V8QI)), it's probably because it's an
invalid rtx, So refine backend patterns for that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/117318
* config/i386/s
It's supported by vector permutation with zero vector.
gcc/ChangeLog:
* config/i386/i386-expand.cc
(ix86_expand_vector_bf2sf_with_vec_perm): New function.
* config/i386/i386-protos.h
(ix86_expand_vector_bf2sf_with_vec_perm): New Declare.
* config/i386/mmx.m
Generate native instruction whenever possible, otherwise use vector
permutation with odd indices.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386-expand.cc
(ix86_expand_vector_sf2bf_with_vec_perm): New function.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk and backport to release branch.
gcc/ChangeLog:
PR target/117240
* config/i386/i386-builtin.def: Add avx/avx512f to vaes
ymm/zmm builtins.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr11
r15-974-gbf7745f887c765e06f2e75508f263debb60aeb2e has optimized for
jcc/setcc, but missed movcc.
The patch supports movcc.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/117232
* config/i386/sse.md (*kortest_cmp_movqicc):
The optimization relies on other patterns which are only available at
GCC14 and obove, so restore the xfail for GCC13/12 branch.
Pushed as an obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512bw-pr103750-2.c: Add xfail for ia32.
---
gcc/testsuite/gcc.target/i386/avx512bw-pr1
r12-6103-g1a7ce8570997eb combines vpcmpuw + zero_extend to vpcmpuw
with the pre_reload splitter, but the splitter transforms the
zero_extend into a subreg which make reload think the upper part is
garbage, it's not correct.
The patch adjusts the zero_extend define_insn_and_split to
define_insn to
Also add hard_float target to avoid failed on arm-eabi, cortex-m0.
Verified on cross-compiler for powerpc64le-linux-gnu, sparc-sun-solaris2.11
Ready push to trunk.
gcc/testsuite/ChangeLog:
PR testsuite/115365
* gcc.dg/pr100927.c: Adjust testcase to avoid scan FIX in REG_EQUIV.
-
---
htdocs/gcc-15/changes.html | 10 ++
1 file changed, 10 insertions(+)
diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html
index 6dc46a52..8a238256 100644
--- a/htdocs/gcc-15/changes.html
+++ b/htdocs/gcc-15/changes.html
@@ -36,6 +36,16 @@ a work-in-progress.
General
For masked FMA, there're 2 forms of RTL representation
1) (vec_merge (fma: op2 op1 op3) op1) mask)
2) (vec_merge (fma: op1 op2 op3) op1) mask)
It's because op1 op2 are communatative in RTL(the second op1 is
written as (match_dup 1))
we once tried to replace (match_dup 1)
with (match_operand:VFH_AV
For x86 masked fma, there're 2 rtl representations
1) (vec_merge (fma op2 op1 op3) op1 mask)
2) (vec_merge (fma op1 op2 op3) op1 mask).
5894(define_insn "_fmadd__mask"
5895 [(set (match_operand:VFH_AVX512VL 0 "register_operand" "=v,v")
5896(vec_merge:VFH_AVX512VL
5897 (fma:VF
diate_operand" "0")) to enable more flexibility for pattern
match and recog, but it triggered an ICE in reload(reload can handle
at most one perand with "0" constraint).
So we need either add 2 patterns in the backend or just do the
canonicalization in the middle-end.
The
Update in V3.
>The testcase looks bogus:
>
> b[i+k] = b[i+k-5] + 2;
>
>accesses b[-3], can you instead adjust the inner loop to start with k == 4?
Changed, also adjust b[100] to b[200] to avoid array out of bound.
>Please remove this testcase - even with fully masking we'd need alias
>versi
>We'd also need to update the documentation:
>... The @samp{very-cheap} model only
>allows vectorization if the vector code would entirely replace the
>scalar code that is being vectorized. For example, if each iteration
>of a vectorized loop would only be able to handle exactly four iterations
>
r15-1737-gb06a108f0fbffe lower AVX512 kmask comparison to AVX2 ones,
but wrong lowered unsigned comparison to signed ones, for unsigned
comparison, only EQ/NEQ can be lowered.
The commit fix that.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
For Crestmont, 4-operand vex blendv instructions come from MSROM and
is slower than 3-instructions sequence (op1 & mask) | (op2 & ~mask).
legacy blendv instruction can still be handled by the decoder.
The patch add a new tune which is enabled for all processors except
for SRF/CWF. It will use vpan
According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2
instructions can be decomposed into two independent 128-bit
micro-operations, except for a subset of Intel AVX2 instructions,
known as cross-lane operations, can only compute the result for an
element by utilizing one or more source
ped and regtested on x86_64-pc-linux-gnu{-m32,}.
The patch generally improves SPEC2017 allrate geomean by 1% with
-march=sierraforest -Ofast on SRF.
Ready push to trunk.
liuhongt (2):
[x86] Add new microarchitecture tune for SRF/GRR/CWF.
[x86] Add a new tune avx256_avoid_vec_perm for SRF.
gcc/testsuite/ChangeLog:
* gcc.dg/fstack-protector-strong.c: Adjust
scan-assembler-times.
* gcc.dg/graphite/scop-6.c: Add
-Wno-aggressive-loop-optimizations.
* gcc.dg/graphite/scop-9.c: Ditto.
* gcc.dg/tree-ssa/ivopts-lt-2.c: Add -fno-tree-vectorize.
>So should we adjust very-cheap to allow niter peeling as proposed or
>should we switch the default at -O2 to cheap?
I prefer the former.
Update in V2:
Adjust testcase after relax O2 vectorization.
Ok for trunk?
gcc/ChangeLog:
* tree-vect-loop.cc (vect_analyze_loop_costing): Enable
Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT.
Otherwise NULL_RTX.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro.
gcc/testsuite/ChangeLog:
* gcc.dg/rtl/x8
GCC12 enables vectorization for O2 with very cheap cost model which is
restricted
to constant tripcount. The vectorization capacity is very limited w/
consideration
of codesize impact.
The patch extends the very cheap cost model a little bit to support variable
tripcount.
But still disable peel
According to Intel Software Optimization Manual[1], the Redwood cove
microarchitecture supports LD+OP and MOV+OP macro fusions.
The patch enables MOV+OP tune for GNR.
[1]
https://www.intel.com/content/www/us/en/content-details/814198/intel-64-and-ia-32-architectures-optimization-reference-manual
It fix the regression by
a51f2fc0d80869ab079a93cc3858f24a1fd28237 is the first bad commit
commit a51f2fc0d80869ab079a93cc3858f24a1fd28237
Author: liuhongt
Date: Wed Sep 4 15:39:17 2024 +0800
Handle const0_operand for *avx2_pcmp3_1.
caused
FAIL: gcc.target/i386/pr59539-1.c scan-assembler
*_eq3_1 supports
nonimm_or_0_operand for op1 and op2, pass_combine would fail to lower
avx512 comparision back to avx2 one when op1/op2 is const0_rtx. It's
because the splitter only support nonimmediate_operand.
Failed to match this instruction:
(set (reg/i:V16QI 20 xmm0)
(vec_merge:V16QI (con
> Can the above loop be a part of ix86_check_avx_upper_register, so this
> function would scan the full RTX for avx upper register?
Changed, also adjust ix86_check_avx_upper_stores and ix86_avx_u128_mode_needed
to either inline the old ix86_check_avx_upper_register or replace
FOR_EACH_SUBRTX
with
For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.
Bootstrapped
> You are possibly overwriting src_related_elt - I'd suggest to either break
> here or do the loop below for each found elt?
Changed.
> Do we know that will always succeed?
1) validate_subreg allows subreg for 2 vector modes with same component modes.
2) gen_lowpart in cse.cc is defined as gen_low
For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS,
then it can only be allocated to ALL_SSE_REGS, and it can be tiebale
to all mode1 with smaller size which is available to FIRST_SSE_REGS.
When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode).
This is need fo
Also try to handle redundant broadcasts when there's already a
broadcast to a bigger mode with exactly the same component value.
For broadcast, component mode needs to be the same.
For all-zeros/ones, only need to check the bigger mode.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and
For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS,
then it can only be allocated to ALL_SSE_REGS, and it can be tiebale
to all mode1 with smaller size which is available to FIRST_SSE_REGS.
When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode).
This is need fo
Looks like -mprefer-vector-width=128 doesn't impact store_max/mov_max
for GCC13/GCC12 branch, explicitly use -mmov-max=128, -mstore-max=128
for those testcases.
Committed as an obvious fix.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pieces-memcpy-10.c: Use -mmove-max=256 and
-mst
When none of mprefer-vector-width, avx256_optimal/avx128_optimal,
avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will
set ix86_{move_max,store_max} as max available vector length except
for AVX part.
if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
&&
>From [1]
> > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > source, especially since ix86_expand_vector_mode does have code to
> > cope with constant operand[1]s. emit_move_insn_1 doesn't check the
> > predicates anyway, so the predicate will have little effect.
> >
> > A
It results in 2 failures for x86_64-pc-linux-gnu{\
-march=cascadelake};
gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
For pr113560.c, now GCC generates mulx instead of mulq with
-march=cascadelake, which should be optimal,
It results in 2 failures for x86_64-pc-linux-gnu{\
-march=cascadelake};
gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o
gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1
For pr113560.c, now GCC generates mulx instead of mulq with
-march=cascadelake, which should be optimal,
> Are there any assumptions that BB_HEAD must be a note or label?
> Maybe we should move ix86_align_loops into a separate pass and insert
> the pass just before pass_final.
The patch inserts .p2align after endbr pass, it can also fix the issue.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3
Ok for trunk?
---
htdocs/gcc-14/changes.html| 7 +++
htdocs/gcc-14/porting_to.html | 9 +
2 files changed, 16 insertions(+)
diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index ca4cae0f..b023a4b9 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/ch
(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
(se
(insn 98 94 387 2 (parallel [
(set (reg:TI 337 [ _32 ])
(ashift:TI (reg:TI 329)
(reg:QI 521)))
(clobber (reg:CC 17 flags))
]) "test.c":11:13 953 {ashlti3_doubleword}
is reloaded into
(insn 98 452 387 2 (parallel [
(se
For below pattern, RA may still allocate r162 as v/k register, try to
reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
which result a linker error.
(set (reg:DI 162)
(mem/u/c:DI
(const:DI (unspec:DI
[(symbol_ref:DI ("a") [flags 0x60] )]
ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4
>The solution proposed here is to have the x86 backend/recog prevent
>early RTL passes composing instructions (that set likely_spilled hard
>registers) that they (combine) can't simplify, until after reload.
>We allow sets fr
> Also, in case the insn is deleted, do:
>
> emit_note (NOTE_INSN_DELETED);
>
> DONE;
>
> instead of leaving (const_int 0) in the stream.
>
> So, the above insn preparation statements should read:
>
> --cut here--
> if (constm1_operand (operands[2], mode))
> emit_move_insn (operands[0], operands[
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/115843
* config/i386/predicates.md (const0_or_m1_operand): New
predicate.
* config/i386/sse.md (*_store_mask_1): New
pre_reload define_insn_and_split.
>- _5 = __atomic_fetch_or_8 (&set_work_pending_p, 1, 0);
>- # DEBUG old => (long int) _5
>+ _6 = .ATOMIC_BIT_TEST_AND_SET (&set_work_pending_p, 0, 1, 0,
>__atomic_fetch_or_8);
>+ # DEBUG old => NULL
> # DEBUG BEGIN_STMT
>- # DEBUG D#2 => _5 & 1
>+ # DEBUG D#2 => NULL
>...
>- _10 = ~_5;
>-
I have a build failure on NetBSD as the namespace pollution avoidance causes
a direct hit with the system /usr/include/math.h
===
In file included from /usr/src/local/gcc/obj/gcc/include/emmintrin.h:31,
from
/usr
>> Hmm, now all avx512 tests SIGILL when testing with -m32:
>>
>> Dump of assembler code for function __get_cpuid_count:
>> => 0x08049500 <+0>: kmovd %eax,%k2
>> 0x08049504 <+4>: kmovd %edx,%k1
>> 0x08049508 <+8>: pushf
>> 0x08049509 <+9>: pushf
>> 0x0804950a <+10>:
From: "H.J. Lu"
>The above reads like it would be worth splitting branc_prediction_hits
>into branch_prediction_hints_taken and branch_prediction_hints_not_taken
>given not-taken is the default and thus will just increase code size?
>According to Intel® 64 and IA-32 Architectures Optimization Ref
The patch can avoid SIGILL on non-AVX512 machine due to kmovd is
generated in dynamic check.
Committed as an obvious fix.
gcc/testsuite/ChangeLog:
PR target/115748
* gcc.target/i386/avx512-check.h: Move runtime check into a
separate function and guard it with target ("no-
From: "H.J. Lu"
According to Intel® 64 and IA-32 Architectures Optimization Reference
Manual[1], Branch Hint is updated for Redwood Cove.
cut from [1]-
Starting with the Redwood Cove microarchitecture, if the predictor has
no stored information about a branch, the
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
cause extra mov between gpr and kmask, add ?k to the pattern.
gcc/ChangeLog:
PR target/115610
* config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
enable it only for lshiftrt and under avx512bw.
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
define target_insn_cost to prevent post_reload pass_late_combine to
revert the optimziation did in pass_rpad.
Adjust testcases since pass_late_combine generates better code but
break scan assembly.
.i.e
Under 32-bit target, gcc
hen do the real operation.
After enabling flate_combine, they're combined into embeded broadcast
operations.
Tested with SPEC2017, flate_combine reduces codesize by ~0.6%, which means
there're lots of small improvements.
Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}.
Ok
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations
with k mask, but mask is defined as uninitialized local variable which
will be set as 0 at rtl expand phase.
And it's further simplified off by late_combine which caused scan assembly
failure.
Move the definition of mask outside to
for the testcase in the PR115406, here is part of the dump.
char D.4882;
vector(1) _1;
vector(1) signed char _2;
char _5;
:
_1 = { -1 };
When assign { -1 } to vector(1} {signed-boolean:8},
Since TYPE_PRECISION (itype) <= BITS_PER_UNIT, so it set each bit of dest
with each vector el
gcc/ChangeLog:
PR target/115517
* config/i386/mmx.md (vcondv2sf): Removed.
(vcond): Ditto.
(vcond): Ditto.
(vcondu): Ditto.
(vcondu): Ditto.
* config/i386/sse.md (vcond): Ditto.
(vcond): Ditto.
(vcond): Ditto.
(vcond):
> Richard suggests that we implement the "obvious" transforms like
> inversion in the middle-end but if for example unsigned compares
> are not supported the us_minus + eq + negative trick isn't on
> that list.
>
> The main reason to restrict vec_cmp would be to avoid
> a <= b ? c : d going with an
gcc/ChangeLog
PR target/115517
* config/i386/sse.md
(*_cvtmask2_not): New pre_reload
splitter.
(*_cvtmask2_not): Ditto.
(*avx2_pcmp3_6): Ditto.
(*avx2_pcmp3_7): Ditto.
---
gcc/config/i386/sse.md | 97 ++
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_movmsk_lt_avx512): New
define_insn_and_split.
(*_movmsk_ext_lt_avx512):
Ditto.
(*_pmovmskb_lt_avx512): Ditto.
(*_pmovmskb_zext_lt_avx512): Ditto.
(*sse2_pmovmskb_ext_lt_a
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Add define_insn_and_split for the optimization did in
ix86_expand_int_vcond.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md ("*ashr3_1"): New
define_insn_and_split.
These define_insn_and_split are needed after vcond{,u,eq} is obsolete.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*_blendv_gt): New
define_insn_and_split.
(*_blendv_gtint):
Ditto.
(*_blendv_not_gtint):
Ditto.
(*_pb
O2
-march=x86-64 -O2
-march=sapphirerapids -O2
Didn't observe obvious performance change, mostly same binaries.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Any comments?
liuhongt (7):
[x86] Add more splitters to match (unspec [op1 op2 (gt op3
constm1_operand)] UNSPEC_BLE
These versions of the min/max patterns implement exactly the operations
min = (op1 < op2 ? op1 : op2)
max = (!(op1 < op2) ? op1 : op2)
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md (*minmax3_1): New pre_reload
define_insn_and_split.
(*minmax3_2): Ditto.
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> recursive processing at any level. You're dealing with MEM [addr]
> here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> the best way to deal with this? Since this is the MEM [addr] case
> we know it's not LEA, no?
416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
It is the case in the PR, the patch uses lower cost to enable more
simplication and fix th
Here's the patch committed.
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
gcc/ChangeLog:
PR target/114189
* match.pd: Simplify a < 0 ? -1 : 0 to (signed) >> 31 and a
> I think the check for TYPE_UNSIGNED should be of TREE_TYPE (@0) rather
> than type here.
Changed
> Or maybe you need `types_match (type, TREE_TYPE (@0))` too.
And use tree_nop_conversion_p (type, TREE_TYPE (@0)) and add view_convert to
rshift.
Bootstrapped and regtested on x86_64-pc-linux-gnu
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Move the optimization did in ix86_expand_int_vcond to match.pd
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, aarch64-linux-gnu.
Ok for trunk?
gcc/ChangeLog:
PR target/114189
The tune is added by PR79390 for SciMark2 on Broadwell.
For latest GCC, with and without the -mtune-ctrl=^one_if_conv_insn.
GCC will generate the same binary for SciMark2. And for SPEC2017,
there's no big impact for SKX/CLX/ICX, and small improvements on SPR
and later.
gcc/ChangeLog:
* co
Use reg_or_subregno instead.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Committed as an obvious patch.
gcc/ChangeLog:
PR target/115452
* config/i386/i386-features.cc (scalar_chain::convert_op): Use
reg_or_subregno instead of REGNO to avoid ICE.
gcc/testsui
r15-1100-gec985bc97a0157 improves handling of ternlog instructions,
now GCC can recognize lots of pternlog_operand with different
variants.
The patch adjust rtx_costs for that, so pass_combine can
reasonably generate more optimal vpternlog instructions.
.i.e
for avx512f-vpternlog-3.c, with the pa
>
> I think if you only handle CONST_INT_P, you should check just for that, and
> in both places where you check for CONST_VECTOR_DUPLICATE_P (there is one
> spot 2 lines above this).
> So add
> && CONST_INT_P (XVECEXP (XEXP (op0, 1), 0, 0))
> and
> && CONST_INT_P (XVECEXP (op1, 0, 0))
> tests righ
In theory, const_wide_int can also be handle with extra check for each
components of the HOST_WIDE_INT array, and the check is need for both
shift and bit_and operands.
I assume the optimization opportnunity is rare, so the patch just add
extra check to make sure GET_MODE_INNER (mode) can fix into
gcc/testsuite/ChangeLog:
* gcc.dg/vect/pr112325.c:Add additional option --param
max-completely-peeled-insns=200 for power64*-*-*.
---
gcc/testsuite/gcc.dg/vect/pr112325.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c
b/gcc/testsuite/gcc.
For power10, there're extra 3 REG_EQUIV notes with (fix:SI. to avoid
the failure. Check (fix:SI is from the pattern not NOTE.
gcc/testsuite/ChangeLog:
PR target/115365
* gcc.dg/pr100927.c: Don't scan fix:SI from the note.
---
gcc/testsuite/gcc.dg/pr100927.c | 2 +-
1 file changed
> Can you add a testcase for this? I don't mind if it's x86 specific and
> does a bit of asm scanning.
>
> Also note that the context for this patch has changed, so it won't
> automatically apply. So be extra careful when updating so that it goes
> into the right place (all the more reason to hav
Commit as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/115299
* gcc.target/i386/pr86722.c: Also scan for blendvpd.
---
gcc/testsuite/gcc.target/i386/pr86722.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/testsuite/gcc.target/i386/pr86722.c
b/gc
W/o TARGET_SSE4_1, it takes 3 instructions (pand, pandn and por) for
movdfcc/movsfcc, and could possibly fail cost comparison. Increase
branch cost could hurt performance for other modes, so specially add
some preference for floating point ifcvt.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-
Committed as an obvious patch.
gcc/ChangeLog:
* config/i386/emmintrin.h (__double_u): Rename from double_u.
(_mm_load_sd): Replace double_u with __double_u.
(_mm_store_sd): Ditto.
(_mm_loadh_pd): Ditto.
(_mm_loadl_pd): Ditto.
* config/i386/xmmintrin
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (vcond_mask_): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr114125.c: New test.
---
gcc/config/i386/sse.md | 20
> IMO, there is no need for CONST_INT_P condition, we should also allow
> symbol_ref, label_ref and const (all allowed by
> x86_64_immediate_operand predicate), these all decay to an immediate
> value.
Changed.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk.
For MEM, rtx_
When I applied Roger's patch [1], there's ICE due to it.
The patch fix the latent bug.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651365.html
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Pushed to trunk.
gcc/ChangeLog:
* config/i386/sse.md
(___mask): Ali
For MEM, rtx_cost iterates each subrtx, and adds up the costs,
so for MEM (reg) and MEM (reg + 4), the former costs 5,
the latter costs 9, it is not accurate for x86. Ideally
address_cost should be used, but it reduce cost too much.
So current solution is make constant disp as cheap as possible.
B
Update in V2:
Guard constant folding for overflow value in
fold_convert_const_int_from_real with flag_trapping_math.
Add -fno-trapping-math to related testcases which warn for overflow
in conversion from floating point to integer.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for tr
Committed as an obvious patch.
gcc/testsuite/ChangeLog:
PR target/114148
* gcc.target/i386/pr106010-7b.c: Refine testcase.
---
gcc/testsuite/gcc.target/i386/pr106010-7b.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/gcc/testsuite/gcc.target/i386/
Update in V3:
> Since this was about vectorization can you instead add a testcase to
> gcc.dg/vect/ and check for
> vectorization to happen?
Move to vect/pr112325.c.
>
> I believe the if (unr_insn <= 0) check can go as well.
Removed.
> as said, you want to do
>
> curolli = false;
>
> aft
>> Hard to find a default value satisfying all testcases.
>> some require loop unroll with 7 insns increment, some don't want loop
>> unroll w/ 5 insn increment.
>> The original 2/3 reduction happened to meet all those testcases(or the
>> testcases are constructed based on the old 2/3).
>> Can we d
According to IEEE standard, for conversions from floating point to
integer. When a NaN or infinite operand cannot be represented in the
destination format and this cannot otherwise be indicated, the invalid
operation exception shall be signaled. When a numeric operand would
convert to an integer ou
For CONST_VECTOR_DUPLICATE_P in constant_pool, it is just broadcast or
variants in ix86_vector_duplicate_simode_const.
Adjust the cost to COSTS_N_INSNS (2) + speed which should be a little
bit larger than broadcast.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/Chang
When mask is (1 << (prec - imm) - 1) which is used to clear upper bits
of A, then it can be simplified to LSHIFTRT.
i.e Simplify
(and:v8hi
(ashifrt:v8hi A 8)
(const_vector 0xff x8))
to
(lshifrt:v8hi A 8)
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok of trunk?
gcc/ChangeLog:
For vec_pack_truncv8si/v4si w/o AVX512,
(const_vector:v4si (const_int 0x) x4) is used as mask to clear
upper 16 bits, but vpblendw with zero_vector can also be used, and
zero vector is cheaper than (const_vector:v4si (const_int 0x) x4).
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3
pshufb is available under TARGET_SSSE3, so
ix86_expand_vec_perm_const_1 must return true when TARGET_SSSE3.
w/o TARGET_SSSE3, if we set one_operand_p to true, ix86_expand_vec_perm_const_1
could return false.
With the patch under -march=x86-64-v2
v8qi
foo (v8qi a)
{
return a >> 5;
}
< pm
Since there is no corresponding instruction, the shift operation for
vector int8 is implemented using the instructions for vector int16,
but for some special shift counts, it can be transformed into vpcmpgtb.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/Chang
As testcase in the PR, O3 cunrolli may prevent vectorization for the
innermost loop and increase register pressure.
The patch removes the 1/3 reduction of unr_insn for innermost loop for UL_ALL.
ul != UR_ALL is needed since some small loop complete unrolling at O2 relies
the reduction.
Bootstrappe
The Fortran standard does not specify what the result of the MAX
and MIN intrinsics are if one of the arguments is a NaN. So it
should be ok to tranform reduction for IFN_COND_MIN with vectorized
COND_MIN and REDUC_MIN.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk and bac
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
Ready push to trunk.
gcc/ChangeLog:
PR target/113090
* config/i386/i386-expand.cc
(expand_vec_perm_punpckldq_pshuf): New function.
(ix86_expand_vec_perm_const_1): Try
expand_vec_perm_punpckldq_pshuf f
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
PR target/113079
* config/i386/mmx.md (usdot_prodv8qi): New expander.
(sdot_prodv8qi): Ditto.
(udot_prodv8qi): Ditto.
(usdot_prodv4hi): Ditto.
(udot_prodv4
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ready push to trunk.
gcc/ChangeLog:
* config/i386/sse.md (usdot_prodv*qi): Extend to VI1_AVX512
with vpmaddwd when avxvnni/avx512vnni is not available.
---
gcc/config/i386/sse.md | 55 +++---
The Intel Decimal Floating-Point Math Library is available as open-source on
Netlib[1].
[1] https://www.netlib.org/misc/intel/.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
libgcc/config/libbid/ChangeLog:
* bid128_fma.c (add_and_round): Fix bug: the result
1 - 100 of 572 matches
Mail list logo