[Bug target/115069] [14/15 regression] 8 bit integer vector performance regression, x86, between gcc-14 and gcc-13 using avx2 target clones on skylake platform

2024-05-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115069 --- Comment #5 from Hongtao Liu --- (In reply to Krzysztof Kanas from comment #4) > I bisected the issue and it seems that commit > 0368fc54bc11f15bfa0ed9913fd0017815dfaa5d introduces regression. I guess the real guilty commit is commit

[Bug target/115116] New: [x86] rtx_cost is overestimated for big size memory.

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115116 Bug ID: 115116 Summary: [x86] rtx_cost is overestimated for big size memory. Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity:

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|NEW

[Bug target/115115] [12/13/14/15 Regression] highway-1.0.7 wrong _mm_cvttps_epi32() constant fold

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115115 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug middle-end/115101] New: [wrong code] with -O1 -floop-nest-optimize for gcc.dg/graphite/interchange-8.c

2024-05-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115101 Bug ID: 115101 Summary: [wrong code] with -O1 -floop-nest-optimize for gcc.dg/graphite/interchange-8.c Product: gcc Version: 15.0 Status: UNCONFIRMED

[Bug target/101017] ICE: Segmentation fault, convert_memory_address_addr_space_1 with vector_size(32) and target_clone arch=core-avx2/default

2024-05-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101017 Hongtao Liu changed: What|Removed |Added CC||haochen.jiang at intel dot com ---

[Bug target/114987] [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms

2024-05-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987 --- Comment #6 from Hongtao Liu --- > I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" > and rebuilt the binary and it will save half the regression. 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5

[Bug rtl-optimization/115021] New: [14/15 regression] unnecessary spill for vpternlog

2024-05-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021 Bug ID: 115021 Summary: [14/15 regression] unnecessary spill for vpternlog Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2024-05-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org

[Bug target/113090] Suboptimal vector permuation for 64-bit vector.

2024-05-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113090 Hongtao Liu changed: What|Removed |Added Resolution|--- |FIXED Status|NEW

[Bug target/113079] [x86] Fails to generate dot_prod instructions for 64-bit vector.

2024-05-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113079 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|---

[Bug target/114943] X86 AVX2: inefficient code generated to convert SIMD Vectors

2024-05-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114943 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug libgcc/114907] __trunchfbf2 should be renamed to __extendhfbf2

2024-05-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114907 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 --- Comment #10 from Hongtao Liu --- (In reply to Jakub Jelinek from comment #9) > Created attachment 58073 [details] > gcc14-pr114883.patch > > Full untested patch. This will fix 521.wrf_r ICE, and pass runtime validation.

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 --- Comment #5 from Hongtao Liu --- (In reply to Hongtao Liu from comment #4) > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index a6cf0a5546c..ae6abe00f3e 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 --- Comment #4 from Hongtao Liu --- diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a6cf0a5546c..ae6abe00f3e 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8505,7 +8505,8 @@ vect_transform_reduction

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 --- Comment #3 from Hongtao Liu --- Created attachment 58066 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58066=edit reproduced testcase gfortran -O2 -march=x86-64-v4 -fvect-cost-model=cheap.

[Bug tree-optimization/114883] [14/15 Regression] 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 --- Comment #2 from Hongtao Liu --- (In reply to Andrew Pinski from comment #1) > Can you reduce the fortran code down for the ICE? It should not be hard, you > can use delta even. Let me try.

[Bug tree-optimization/114883] New: 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap

2024-04-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114883 Bug ID: 114883 Summary: 521.wrf_r ICE with -O2 -march=sapphirerapids -fvect-cost-model=cheap Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal

[Bug target/110621] x86_64: Test gcc.target/i386/pr105354-2.c fails with -fstack-protector

2024-04-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110621 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/85048] [missed optimization] vector conversions

2024-04-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048 --- Comment #16 from Hongtao Liu --- (In reply to Matthias Kretz (Vir) from comment #15) > So it seems that if at least one of the vector builtins involved in the > expression is 512 bits GCC needs to locally increase prefer-vector-width to >

[Bug target/85048] [missed optimization] vector conversions

2024-04-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731 --- Comment #7 from Hongtao Liu --- (In reply to Hongtao Liu from comment #4) > (In reply to Hongtao Liu from comment #3) > > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look. > > Oh, ix86_vect_estimate_reg_pressure

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731 --- Comment #4 from Hongtao Liu --- (In reply to Hongtao Liu from comment #3) > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look. Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #16 from Hongtao Liu --- > > 4952 /* See if a MEM has already been loaded with a widening operation; > 4953 if it has, we can use a subreg of that. Many CISC machines > 4954 also have such operations, but

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #15 from Hongtao Liu --- > I don't see this as problematic. IIRC, there was a discussion in the past > that a couple (two?) memory accesses from the same location close to each > other can be faster (so, -O2, not -Os) than

[Bug middle-end/110027] [11/12/13/14 regression] Stack objects with extended alignments (vectors etc) misaligned on detect_stack_use_after_return

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #19 from Hongtao Liu --- (In reply to Jakub Jelinek from comment #17) > Both of the posted patches are incorrect, this needs to be fixed in > asan_emit_stack_protection, account for the different offsets[0] which > happens when a

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #12 from Hongtao Liu --- short a; short c; short d; void foo (short b, short f) { c = b + a; d = f + a; } foo(short, short): addwa(%rip), %di addwa(%rip), %si movw%di, c(%rip) movw

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #11 from Hongtao Liu --- unsigned v; long long v2; char foo () { v2 = v; return v; } This is related to *movqi_internal, and codegen has been worse since gcc8.1 foo: movlv(%rip), %eax movq%rax,

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #9 from Hongtao Liu --- > > It looks that different modes of memory read confuse LRA to not CSE the read. > > IMO, if the preloaded value is later accessed in different modes, LRA should > leave it. Alternatively, LRA should CSE

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #5 from Hongtao Liu --- > My experience is memory cost for the operand with rm or separate r, m is > different which impacts RA decision. > > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html Change operands[1]

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/113288] [i386] Missing #define for -mavx10.1-256 and -mavx10.1-512

2024-04-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113288 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544 --- Comment #3 from Hongtao Liu --- <__umodti3>: ... 37 58: 66 48 0f 6e c7 movq %rdi,%xmm0 38 5d: 66 48 0f 6e d6 movq %rsi,%xmm2 39 62: 66 0f 6c c2 punpcklqdq %xmm2,%xmm0 40 66:

[Bug target/114570] New: GCC doesn't perform good loop invariant code motion for very long vector operations.

2024-04-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114570 Bug ID: 114570 Summary: GCC doesn't perform good loop invariant code motion for very long vector operations. Product: gcc Version: 14.0 Status: UNCONFIRMED

[Bug rtl-optimization/114556] New: weird loop unrolling when there's attribute aligned in side the loop

2024-04-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114556 Bug ID: 114556 Summary: weird loop unrolling when there's attribute aligned in side the loop Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544 --- Comment #2 from Hongtao Liu --- Also for void foo2 (v128_t* a, v128_t* b) { c = (*a & *b)+ *b; } (insn 9 8 10 2 (set (reg:V1TI 108 [ _3 ]) (and:V1TI (reg:V1TI 99 [ _2 ]) (mem:V1TI (reg:DI 113) [1 *a_6(D)+0 S16

[Bug target/114544] [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544 --- Comment #1 from Hongtao Liu --- 20590;; Turn SImode or DImode extraction from arbitrary SSE/AVX/AVX512F 20591;; vector modes into vec_extract*. 20592(define_split 20593 [(set (match_operand:SWI48x 0 "nonimmediate_operand") 20594

[Bug target/114544] New: [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1))

2024-04-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114544 Bug ID: 114544 Summary: [x86] stv should transform (subreg DI (V1TI) 8) as (vec_select:DI (V2DI) (const_int 1)) Product: gcc Version: 14.0 Status: UNCONFIRMED

[Bug target/114514] v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 --- Comment #3 from Hongtao Liu --- (In reply to Andrew Pinski from comment #1) > Confirmed. > > Note non sign bit can be improved too: > ``` I assume you're talking about broadcast from imm or directly from constant pool. GCC chooses the

[Bug target/114514] New: v16qi >> 7 can be optimized with vpcmpgtb

2024-03-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114514 Bug ID: 114514 Summary: v16qi >> 7 can be optimized with vpcmpgtb Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component:

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471 --- Comment #6 from Hongtao Liu --- (In reply to Hongtao Liu from comment #5) > Maybe we should always use kmask under AVX512, currently only >= 128-bits > vector of vector _Float16 use kmask, < 128 bits vector still use vector mask. > and we

[Bug tree-optimization/114471] [14 regression] ICE when building liblc3-1.0.4 with -fno-vect-cost-model -march=x86-64-v4

2024-03-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114471 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429 --- Comment #2 from Hongtao Liu --- (In reply to Hongtao Liu from comment #1) > when x is INT_MIN, I assume -x is UD, so compiler can do anything. > otherwise, (-x) >> 31 is just x > 0. > From rtl view. neg of INT_MIN is assumed to 0 after it's

[Bug target/114429] [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429 Hongtao Liu changed: What|Removed |Added Target||x86_64-*-* i?86-*-* --- Comment #1 from

[Bug target/114429] New: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0.

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114429 Bug ID: 114429 Summary: [x86] (neg a) ashifrt>> 31 can be optimized to a > 0. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3

[Bug target/114428] New: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0xffff x4) can be optimized to psrld

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428 Bug ID: 114428 Summary: [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0x x4) can be optimized to psrld Product: gcc Version: 14.0 Status: UNCONFIRMED

[Bug target/114427] New: [x86] ec_pack_truncv8si/v4si can be optimized with pblendw instead of pand for AVX2 target

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114427 Bug ID: 114427 Summary: [x86] ec_pack_truncv8si/v4si can be optimized with pblendw instead of pand for AVX2 target Product: gcc Version: 14.0 Status: UNCONFIRMED

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|---

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #20 from Hongtao Liu --- (In reply to JuzheZhong from comment #19) > I think it's better to add pr114396.c into vect testsuite instead of x86 > target test since it's the bug not only happens on x86. Sure, there's no target

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 --- Comment #9 from Hongtao Liu --- > If we were to expose that vpxor before postreload we'd likely CSE but > we have > > 5: xmm0:V4SI=const_vector > REG_EQUIV const_vector > 6: [`b']=xmm0:V4SI > 7: xmm0:V8HI=const_vector >

[Bug rtl-optimization/92080] Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2024-03-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #17 from Hongtao Liu --- > > > > The to_mpz args look like they could be mixing signs as well: > > I tries below, looks like mixing signs works well. debug show step_expr is -5 and signed. short a = 0xF; short b[16]; unsigned

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 Hongtao Liu changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #16 from Hongtao

[Bug tree-optimization/114396] [13/14 Regression] Vector: Runtime mismatch at -O2 with -fwrapv since r13-7988-g82919cf4cb2321

2024-03-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114396 --- Comment #15 from Hongtao Liu --- (In reply to Richard Biener from comment #9) > (In reply to Robin Dapp from comment #8) > > No fallout on x86 or aarch64. > > > > Of course using false instead of TYPE_SIGN (utype) is also possible and > >

[Bug tree-optimization/67683] Missed vectorization: shifts of an induction variable

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67683 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug middle-end/114347] wrong constant folding when casting __bf16 to int

2024-03-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114347 --- Comment #9 from Hongtao Liu --- (In reply to Richard Biener from comment #7) > (In reply to Jakub Jelinek from comment #6) > > You can use -fexcess-precision=16 if you don't want treating _Float16 and > > __bf16 as having excess precision.

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334 Hongtao Liu changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|---

[Bug tree-optimization/66862] OpenMP SIMD does not work (use SIMD instructions) on conditional code

2024-03-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66862 --- Comment #5 from Hongtao Liu --- > Now, it seems AVX512BW (and AVX512VL in some cases) has the needed > instructions, > in particular VMOVDQU{8,16}, but it is not reflected in maskload and > maskstore expanders. CCing Kyrill and Uros on

[Bug target/114334] [14 Regression] ICE: in extract_insn, at recog.cc:2812 (unrecognizable insn and:HF?) with lroundf16() and -ffast-math -mavx512fp16

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114334 Hongtao Liu changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed|

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #15 from Hongtao Liu --- A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647604.html

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822 Hongtao Liu changed: What|Removed |Added Status|NEW |RESOLVED Resolution|---

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #14 from Hongtao Liu --- diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc index 0de299c62e3..92062378d8e 100644 --- a/gcc/cfgexpand.cc +++ b/gcc/cfgexpand.cc @@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class

[Bug libgcc/111731] [13/14 regression] gcc_assert is hit at libgcc/unwind-dw2-fde.c#L291

2024-03-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111731 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #13 from Hongtao Liu --- So the stack is like --- stack top -32 - (offset -32) -64 (32 bytes redzone) - (offset -64) -128 (64 bytes __m512) (offset -128) (32-bytes redzone) ---(offset

[Bug target/110027] [11/12/13/14 regression] Misaligned vector store on detect_stack_use_after_return

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #12 from Hongtao Liu --- (In reply to Sam James from comment #11) > Calling it a 11..14 regression as we know 14 is bad and 7.5 is OK, but I > can't test 11/12 on an avx512 machine right now. I can't reproduce that with 11/12, but

[Bug target/111822] [12/13/14 Regression] during RTL pass: lr_shrinkage ICE: in operator[], at vec.h:910 with -O2 -m32 -flive-range-shrinkage -fno-dce -fnon-call-exceptions since r12-5301-g04520645038

2024-03-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111822 --- Comment #16 from Hongtao Liu --- (In reply to Uroš Bizjak from comment #11) > (In reply to Richard Biener from comment #10) > > The easiest fix would be to refuse applying STV to a insn that > > can_throw_internal () (that's an insn that

[Bug d/114171] [13/14 Regression] gdc -O2 -mavx generates misaligned vmovdqa instruction

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114171 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org Last

[Bug tree-optimization/114164] simdclone vectorization creates unsupported IL

2024-02-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114164 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 --- Comment #16 from Hongtao Liu --- > I'm all for removing the 1/3 for innermost loop handling (in cunroll > the unrolled loop is then innermost). I'm more concerned about > unrolling more than one level which is exactly what's required for

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 --- Comment #14 from Hongtao Liu --- (In reply to rguent...@suse.de from comment #13) > On Tue, 27 Feb 2024, liuhongt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 > > > > --- Comment #11 from Hongtao Liu

[Bug target/114125] Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125 Hongtao Liu changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED

[Bug target/114125] New: Support vcond_mask_qiqi and friends.

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114125 Bug ID: 114125 Summary: Support vcond_mask_qiqi and friends. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 --- Comment #11 from Hongtao Liu --- >Loop body is likely going to simplify further, this is difficult >to guess, we just decrease the result by 1/3. */ > This is introduced by r0-68074-g91a01f21abfe19 /* Estimate number of insns

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 --- Comment #10 from Hongtao Liu --- (In reply to Hongtao Liu from comment #9) > The original case is a little different from the one in PR. But the issue is similar, after cunrolli, GCC failed to vectorize the outer loop. The interesting

[Bug tree-optimization/112325] Missed vectorization of reduction after unrolling

2024-02-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112325 --- Comment #9 from Hongtao Liu --- The original case is a little different from the one in PR. It comes from ggml #include #include typedef uint16_t ggml_fp16_t; static float table_f32_f16[1 << 16]; inline static float

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #11 from Hongtao Liu --- (In reply to N Schaeffer from comment #9) > In addition, optimizing for size with -Os leads to a non-vectorized > double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced > by clang -Os)

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #8 from Hongtao Liu --- (In reply to Hongtao Liu from comment #7) > perm_cost is very low in x86 backend, and it maybe ok for 128-bit vectors, > pshufb/shufps are avaible for most cases. > But for 256/512-bit vectors, when the

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug tree-optimization/109885] gcc does not generate movmskps and testps instructions (clang does)

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885 --- Comment #4 from Hongtao Liu --- int sum() { int ret = 0; for (int i=0; i<8; ++i) ret +=(0==v[i]); return ret; } int sum2() { int ret = 0; auto m = v==0; for (int i=0; i<8; ++i) ret += m[i]; return ret; } For sum, gcc

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #57 from Hongtao Liu --- > For dg-do run testcases I really think we should avoid those -march= > options, because it means a lot of other stuff, BMI, LZCNT, ... Make sense.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #45 from Hongtao Liu --- > > There's do_store_flag to fixup for uses not in branches and > > do_compare_and_jump for conditional jumps. > > reasonable enough for me. I mean we only handle it at consumers where upper bits matters.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #44 from Hongtao Liu --- > > Note the AND is removed by combine if I add it: > > Successfully matched this instruction: > (set (reg:CCZ 17 flags) > (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0)) >

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #43 from Hongtao Liu --- > Well, yes, the discussion in this bug was whether to do this at consumers > (that's sth new) or with all mask operations (that's how we handle > bit-precision integer operations, so it might be relatively

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #39 from Hongtao Liu --- > > the question is whether that matches the semantics of GIMPLE (the padding > > is inverted, too), whether it invokes undefined behavior (don't do it - it > > seems for people using intrinsics that's what

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #38 from Hongtao Liu --- > I think we should also mask off the upper bits of variable mask? > > notl%esi > orl %esi, %edi > notl%edi > andl$15, %edi > je .L3 with

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-02-07 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #37 from Hongtao Liu --- (In reply to Richard Biener from comment #36) > For example with AVX512VL and the following, using -O -fgimple -mavx512vl > we get simply > > notl%esi > orl %esi, %edi > cmpb

[Bug target/113729] Missing APX NDD optimization

2024-02-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113729 --- Comment #2 from Hongtao Liu --- extern unsigned char b; int foo (void) { return (unsigned char)(200 + b); } gcc -O2 -mapxf foo(): subb $56, b(%rip), %al movzbl %al, %eax ret And this can be optimzied to foo(): subb $56,

[Bug target/113744] Unnecessary "m" constraint in *adddi_4

2024-02-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/113729] Missing APX NDD optimization

2024-02-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113729 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment

[Bug target/113656] [x86] ICE in simplify_const_unary_operation, at simplify-rtx.cc:1954 with new -mavx10.1

2024-01-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113656 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org,

[Bug target/113600] [14 regression] 525.x264_r run-time regresses by 8% with PGO -Ofast -march=znver4

2024-01-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600 --- Comment #6 from Hongtao Liu --- Guess explicit .REDUC_PLUS instead of original VEC_PERM_EXPR somehow impacts the store split decision.

[Bug target/113600] [14 regression] 525.x264_r run-time regresses by 8% with PGO -Ofast -march=znver4

2024-01-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113600 --- Comment #5 from Hongtao Liu --- It looks like x264_pixel_satd_16x16 consumes more time after my commit, an extracted case is as below, note there's no attribute((always_inline)) in the original x264_pixel_satd_8x4, it's added to force

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #28 from Hongtao Liu --- I saw we already maskoff integral modes for vector mask in store_constructor /* Use sign-extension for uniform boolean vectors with integer modes and single-bit mask entries.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-29 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #25 from Hongtao Liu --- (In reply to Tamar Christina from comment #24) > Just to avoid confusion, are you still working on this one Richi? I'm working on a patch to add a target hook as #c18 mentioned.

[Bug tree-optimization/113576] [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576 --- Comment #22 from Hongtao Liu --- typedef unsigned long mp_limb_t; typedef long mp_size_t; typedef unsigned long mp_bitcnt_t; typedef mp_limb_t *mp_ptr; typedef const mp_limb_t *mp_srcptr; #define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)

[Bug target/113609] EQ/NE comparison between avx512 kmask and -1 can be optimized with kxortest with checking CF.

2024-01-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113609 --- Comment #1 from Hongtao Liu --- Since they're different modes, CCZ for cmp, but CCS for kortest, it could be diffcult to optimize it in RA stage by adding alternatives(like we did for compared to 0). So the easy way could be adding peephole

  1   2   >