https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069
--- Comment #5 from Hongtao Liu ---
(In reply to Krzysztof Kanas from comment #4)
> I bisected the issue and it seems that commit
> 0368fc54bc11f15bfa0ed9913fd0017815dfaa5d introduces regression.
I guess the real guilty commit is
commit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115116
Bug ID: 115116
Summary: [x86] rtx_cost is overestimated for big size memory.
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Hongtao Liu changed:
What|Removed |Added
Resolution|--- |FIXED
Status|NEW
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115115
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115101
Bug ID: 115101
Summary: [wrong code] with -O1 -floop-nest-optimize for
gcc.dg/graphite/interchange-8.c
Product: gcc
Version: 15.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101017
Hongtao Liu changed:
What|Removed |Added
CC||haochen.jiang at intel dot com
---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
--- Comment #6 from Hongtao Liu ---
> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)"
> and rebuilt the binary and it will save half the regression.
57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021
Bug ID: 115021
Summary: [14/15 regression] unnecessary spill for vpternlog
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113090
Hongtao Liu changed:
What|Removed |Added
Resolution|--- |FIXED
Status|NEW
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113079
Hongtao Liu changed:
What|Removed |Added
Status|NEW |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114943
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114907
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
--- Comment #10 from Hongtao Liu ---
(In reply to Jakub Jelinek from comment #9)
> Created attachment 58073 [details]
> gcc14-pr114883.patch
>
> Full untested patch.
This will fix 521.wrf_r ICE, and pass runtime validation.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
--- Comment #5 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #4)
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index a6cf0a5546c..ae6abe00f3e 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
--- Comment #4 from Hongtao Liu ---
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a6cf0a5546c..ae6abe00f3e 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8505,7 +8505,8 @@ vect_transform_reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
--- Comment #3 from Hongtao Liu ---
Created attachment 58066
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58066=edit
reproduced testcase
gfortran -O2 -march=x86-64-v4 -fvect-cost-model=cheap.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
--- Comment #2 from Hongtao Liu ---
(In reply to Andrew Pinski from comment #1)
> Can you reduce the fortran code down for the ICE? It should not be hard, you
> can use delta even.
Let me try.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883
Bug ID: 114883
Summary: 521.wrf_r ICE with -O2 -march=sapphirerapids
-fvect-cost-model=cheap
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110621
Hongtao Liu changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048
--- Comment #16 from Hongtao Liu ---
(In reply to Matthias Kretz (Vir) from comment #15)
> So it seems that if at least one of the vector builtins involved in the
> expression is 512 bits GCC needs to locally increase prefer-vector-width to
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731
--- Comment #7 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #4)
> (In reply to Hongtao Liu from comment #3)
> > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.
>
> Oh, ix86_vect_estimate_reg_pressure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731
--- Comment #4 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #3)
> Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.
Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #16 from Hongtao Liu ---
>
> 4952 /* See if a MEM has already been loaded with a widening operation;
> 4953 if it has, we can use a subreg of that. Many CISC machines
> 4954 also have such operations, but
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #15 from Hongtao Liu ---
> I don't see this as problematic. IIRC, there was a discussion in the past
> that a couple (two?) memory accesses from the same location close to each
> other can be faster (so, -O2, not -Os) than
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #19 from Hongtao Liu ---
(In reply to Jakub Jelinek from comment #17)
> Both of the posted patches are incorrect, this needs to be fixed in
> asan_emit_stack_protection, account for the different offsets[0] which
> happens when a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #12 from Hongtao Liu ---
short a;
short c;
short d;
void
foo (short b, short f)
{
c = b + a;
d = f + a;
}
foo(short, short):
addwa(%rip), %di
addwa(%rip), %si
movw%di, c(%rip)
movw
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #11 from Hongtao Liu ---
unsigned v;
long long v2;
char foo ()
{
v2 = v;
return v;
}
This is related to *movqi_internal, and codegen has been worse since gcc8.1
foo:
movlv(%rip), %eax
movq%rax,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #9 from Hongtao Liu ---
>
> It looks that different modes of memory read confuse LRA to not CSE the read.
>
> IMO, if the preloaded value is later accessed in different modes, LRA should
> leave it. Alternatively, LRA should CSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
--- Comment #5 from Hongtao Liu ---
> My experience is memory cost for the operand with rm or separate r, m is
> different which impacts RA decision.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
Change operands[1]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862
Hongtao Liu changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113288
Hongtao Liu changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544
--- Comment #3 from Hongtao Liu ---
<__umodti3>:
...
37 58: 66 48 0f 6e c7 movq %rdi,%xmm0
38 5d: 66 48 0f 6e d6 movq %rsi,%xmm2
39 62: 66 0f 6c c2 punpcklqdq %xmm2,%xmm0
40 66:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114570
Bug ID: 114570
Summary: GCC doesn't perform good loop invariant code motion
for very long vector operations.
Product: gcc
Version: 14.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114556
Bug ID: 114556
Summary: weird loop unrolling when there's attribute aligned in
side the loop
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544
--- Comment #2 from Hongtao Liu ---
Also for
void
foo2 (v128_t* a, v128_t* b)
{
c = (*a & *b)+ *b;
}
(insn 9 8 10 2 (set (reg:V1TI 108 [ _3 ])
(and:V1TI (reg:V1TI 99 [ _2 ])
(mem:V1TI (reg:DI 113) [1 *a_6(D)+0 S16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544
--- Comment #1 from Hongtao Liu ---
20590;; Turn SImode or DImode extraction from arbitrary SSE/AVX/AVX512F
20591;; vector modes into vec_extract*.
20592(define_split
20593 [(set (match_operand:SWI48x 0 "nonimmediate_operand")
20594
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544
Bug ID: 114544
Summary: [x86] stv should transform (subreg DI (V1TI) 8) as
(vec_select:DI (V2DI) (const_int 1))
Product: gcc
Version: 14.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
--- Comment #3 from Hongtao Liu ---
(In reply to Andrew Pinski from comment #1)
> Confirmed.
>
> Note non sign bit can be improved too:
> ```
I assume you're talking about broadcast from imm or directly from constant
pool. GCC chooses the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514
Bug ID: 114514
Summary: v16qi >> 7 can be optimized with vpcmpgtb
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471
--- Comment #6 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #5)
> Maybe we should always use kmask under AVX512, currently only >= 128-bits
> vector of vector _Float16 use kmask, < 128 bits vector still use vector mask.
>
and we
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429
Hongtao Liu changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429
--- Comment #2 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #1)
> when x is INT_MIN, I assume -x is UD, so compiler can do anything.
> otherwise, (-x) >> 31 is just x > 0.
> From rtl view. neg of INT_MIN is assumed to 0 after it's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429
Hongtao Liu changed:
What|Removed |Added
Target||x86_64-*-* i?86-*-*
--- Comment #1 from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429
Bug ID: 114429
Summary: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428
Bug ID: 114428
Summary: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector
(0x x4) can be optimized to psrld
Product: gcc
Version: 14.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114427
Bug ID: 114427
Summary: [x86] ec_pack_truncv8si/v4si can be optimized with
pblendw instead of pand for AVX2 target
Product: gcc
Version: 14.0
Status: UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396
Hongtao Liu changed:
What|Removed |Added
Status|ASSIGNED|RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396
--- Comment #20 from Hongtao Liu ---
(In reply to JuzheZhong from comment #19)
> I think it's better to add pr114396.c into vect testsuite instead of x86
> target test since it's the bug not only happens on x86.
Sure, there's no target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
--- Comment #9 from Hongtao Liu ---
> If we were to expose that vpxor before postreload we'd likely CSE but
> we have
>
> 5: xmm0:V4SI=const_vector
> REG_EQUIV const_vector
> 6: [`b']=xmm0:V4SI
> 7: xmm0:V8HI=const_vector
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396
--- Comment #17 from Hongtao Liu ---
> >
> > The to_mpz args look like they could be mixing signs as well:
> >
I tries below, looks like mixing signs works well.
debug show step_expr is -5 and signed.
short a = 0xF;
short b[16];
unsigned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396
Hongtao Liu changed:
What|Removed |Added
Status|NEW |ASSIGNED
--- Comment #16 from Hongtao
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396
--- Comment #15 from Hongtao Liu ---
(In reply to Richard Biener from comment #9)
> (In reply to Robin Dapp from comment #8)
> > No fallout on x86 or aarch64.
> >
> > Of course using false instead of TYPE_SIGN (utype) is also possible and
> >
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114347
--- Comment #9 from Hongtao Liu ---
(In reply to Richard Biener from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > You can use -fexcess-precision=16 if you don't want treating _Float16 and
> > __bf16 as having excess precision.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334
Hongtao Liu changed:
What|Removed |Added
Status|ASSIGNED|RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862
--- Comment #5 from Hongtao Liu ---
> Now, it seems AVX512BW (and AVX512VL in some cases) has the needed
> instructions,
> in particular VMOVDQU{8,16}, but it is not reflected in maskload and
> maskstore expanders. CCing Kyrill and Uros on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334
Hongtao Liu changed:
What|Removed |Added
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #15 from Hongtao Liu ---
A patch is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647604.html
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822
Hongtao Liu changed:
What|Removed |Added
Status|NEW |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #14 from Hongtao Liu ---
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 0de299c62e3..92062378d8e 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #13 from Hongtao Liu ---
So the stack is like
--- stack top
-32
- (offset -32)
-64 (32 bytes redzone)
- (offset -64)
-128 (64 bytes __m512)
(offset -128)
(32-bytes redzone)
---(offset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #12 from Hongtao Liu ---
(In reply to Sam James from comment #11)
> Calling it a 11..14 regression as we know 14 is bad and 7.5 is OK, but I
> can't test 11/12 on an avx512 machine right now.
I can't reproduce that with 11/12, but
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822
--- Comment #16 from Hongtao Liu ---
(In reply to Uroš Bizjak from comment #11)
> (In reply to Richard Biener from comment #10)
> > The easiest fix would be to refuse applying STV to a insn that
> > can_throw_internal () (that's an insn that
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114171
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
Last
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114164
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #16 from Hongtao Liu ---
> I'm all for removing the 1/3 for innermost loop handling (in cunroll
> the unrolled loop is then innermost). I'm more concerned about
> unrolling more than one level which is exactly what's required for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #14 from Hongtao Liu ---
(In reply to rguent...@suse.de from comment #13)
> On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
> >
> > --- Comment #11 from Hongtao Liu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125
Hongtao Liu changed:
What|Removed |Added
Ever confirmed|0 |1
Status|UNCONFIRMED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125
Bug ID: 114125
Summary: Support vcond_mask_qiqi and friends.
Product: gcc
Version: 14.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #11 from Hongtao Liu ---
>Loop body is likely going to simplify further, this is difficult
>to guess, we just decrease the result by 1/3. */
>
This is introduced by r0-68074-g91a01f21abfe19
/* Estimate number of insns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #10 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #9)
> The original case is a little different from the one in PR.
But the issue is similar, after cunrolli, GCC failed to vectorize the outer
loop.
The interesting
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325
--- Comment #9 from Hongtao Liu ---
The original case is a little different from the one in PR.
It comes from ggml
#include
#include
typedef uint16_t ggml_fp16_t;
static float table_f32_f16[1 << 16];
inline static float
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #11 from Hongtao Liu ---
(In reply to N Schaeffer from comment #9)
> In addition, optimizing for size with -Os leads to a non-vectorized
> double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced
> by clang -Os)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
--- Comment #8 from Hongtao Liu ---
(In reply to Hongtao Liu from comment #7)
> perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors,
> pshufb/shufps are avaible for most cases.
> But for 256/512-bit vectors, when the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885
--- Comment #4 from Hongtao Liu ---
int sum() {
int ret = 0;
for (int i=0; i<8; ++i) ret +=(0==v[i]);
return ret;
}
int sum2() {
int ret = 0;
auto m = v==0;
for (int i=0; i<8; ++i) ret += m[i];
return ret;
}
For sum, gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #57 from Hongtao Liu ---
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...
Make sense.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #45 from Hongtao Liu ---
> > There's do_store_flag to fixup for uses not in branches and
> > do_compare_and_jump for conditional jumps.
>
> reasonable enough for me.
I mean we only handle it at consumers where upper bits matters.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #44 from Hongtao Liu ---
>
> Note the AND is removed by combine if I add it:
>
> Successfully matched this instruction:
> (set (reg:CCZ 17 flags)
> (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #43 from Hongtao Liu ---
> Well, yes, the discussion in this bug was whether to do this at consumers
> (that's sth new) or with all mask operations (that's how we handle
> bit-precision integer operations, so it might be relatively
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #39 from Hongtao Liu ---
> > the question is whether that matches the semantics of GIMPLE (the padding
> > is inverted, too), whether it invokes undefined behavior (don't do it - it
> > seems for people using intrinsics that's what
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #38 from Hongtao Liu ---
> I think we should also mask off the upper bits of variable mask?
>
> notl%esi
> orl %esi, %edi
> notl%edi
> andl$15, %edi
> je .L3
with
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #37 from Hongtao Liu ---
(In reply to Richard Biener from comment #36)
> For example with AVX512VL and the following, using -O -fgimple -mavx512vl
> we get simply
>
> notl%esi
> orl %esi, %edi
> cmpb
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113729
--- Comment #2 from Hongtao Liu ---
extern unsigned char b;
int
foo (void)
{
return (unsigned char)(200 + b);
}
gcc -O2 -mapxf
foo():
subb $56, b(%rip), %al
movzbl %al, %eax
ret
And this can be optimzied to
foo():
subb $56,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113729
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org
--- Comment
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113656
Hongtao Liu changed:
What|Removed |Added
CC||liuhongt at gcc dot gnu.org,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600
--- Comment #6 from Hongtao Liu ---
Guess explicit .REDUC_PLUS instead of original VEC_PERM_EXPR somehow impacts
the store split decision.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600
--- Comment #5 from Hongtao Liu ---
It looks like x264_pixel_satd_16x16 consumes more time after my commit, an
extracted case is as below, note there's no attribute((always_inline)) in the
original x264_pixel_satd_8x4, it's added to force
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #28 from Hongtao Liu ---
I saw we already maskoff integral modes for vector mask in store_constructor
/* Use sign-extension for uniform boolean vectors with
integer modes and single-bit mask entries.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #25 from Hongtao Liu ---
(In reply to Tamar Christina from comment #24)
> Just to avoid confusion, are you still working on this one Richi?
I'm working on a patch to add a target hook as #c18 mentioned.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
--- Comment #22 from Hongtao Liu ---
typedef unsigned long mp_limb_t;
typedef long mp_size_t;
typedef unsigned long mp_bitcnt_t;
typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;
#define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113609
--- Comment #1 from Hongtao Liu ---
Since they're different modes, CCZ for cmp, but CCS for kortest, it could be
diffcult to optimize it in RA stage by adding alternatives(like we did for
compared to 0). So the easy way could be adding peephole
1 - 100 of 194 matches
Mail list logo