from:"liuhongt at gcc dot gnu.org"

[Bug rtl-optimization/115021] [14 regression] unnecessary spill for vpternlog

2024-08-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Hongtao Liu  ---
Fixed in GCC15.

[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst

2024-08-01 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao Liu  ---
Fixed in GCC15.

[Bug tree-optimization/89749] Very odd vector constructor

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89749

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
  Known to work||12.1.0
 CC||liuhongt at gcc dot gnu.org
 Status|NEW |RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC12 and above.

[Bug target/113744] Unnecessary "m" constraint in *adddi_4

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115981] [14/15 Regression] Redundant vmovaps to itself after vmovups since r14-537

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115981

--- Comment #4 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #3)
> Created attachment 58786 [details]
> gcc15-pr115981.patch
> 
> Untested fix.  As since that commit it checks swap_commutative_operands_p:
> 1) CONST_VECTOR I think has commutative_operand_precedence -4
> 2) REG has commutative_operand_precedence -1 or -2
> 3) SUBREG of object has commutative_operand_precedence -3
> 4) VEC_DUPLICATE has commutative_operand_precedence 0
> Which means the VEC_DUPLICATE operand will always come first and whatever
> matches reg_or_0_operand will always come second, i.e. exactly not the order
> in the pattern, so we don't need to add another one, can just change order
> of this one.

Patch LGTM.

[Bug target/116122] [14/15 regression] __FLT16_MAX__ is defined even with -mno-sse2 on 32-bit x86

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116122

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Hongtao Liu  ---
Mentioned in GCC14 "Changes" and "Porting to" documentation.

[Bug target/85236] missing _mm256_atan2_ps

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85236

Hongtao Liu  changed:

   What|Removed |Added

 CC||binklings at 163 dot com

--- Comment #8 from Hongtao Liu  ---
*** Bug 116157 has been marked as a duplicate of this bug. ***

[Bug target/116157] AVX2 _mm256_exp_ps function is missing in the compiler

2024-07-31 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116157

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Hongtao Liu  ---
Don't have plan to support it in GCC.

*** This bug has been marked as a duplicate of bug 85236 ***

[Bug target/113744] Unnecessary "m" constraint in *adddi_4

2024-07-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113744

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |lingling.kong7 at gmail 
dot com
 Ever confirmed|0   |1
   Last reconfirmed|2024-02-04 00:00:00 |2024-07-31

--- Comment #4 from Hongtao Liu  ---
Then please remove constraint from the pattern.

[Bug target/116122] [14/15 regression] __FLT16_MAX__ is defined even with -mno-sse2 on 32-bit x86

2024-07-28 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116122

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-29
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org

--- Comment #4 from Hongtao Liu  ---
> If this gcc change will not be reverted, it should be documented as a change
> in the gcc 14 "Changes" and "Porting to" documentation.

I'll add some documents for that.

[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst

2024-07-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096

--- Comment #3 from Hongtao Liu  ---

> 
>  (define_insn "ashl3_doubleword"
>[(set (match_operand:DWI 0 "register_operand" "=,")
> -   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
> +   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r")
> (match_operand:QI 2 "nonmemory_operand" "c,c")))
> (clobber (reg:CC FLAGS_REG))]
>""
The patch is incomplete, it should also support integer 1 since pm1_operand
means 1 or -1.

[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst

2024-07-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096

Hongtao Liu  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug rtl-optimization/116096] [15 Regression] during RTL pass: cprop_hardreg ICE: in extract_insn, at recog.cc:2848 (unrecognizable insn ashift:TI?) with -O2 -flive-range-shrinkage -fno-peephole2 -mst

2024-07-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116096

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> This is interesting.
> 
> After reload we have:
> ```
> (insn 450 93 97 2 (set (reg:QI 2 cx [521])
> (reg:QI 38 r10 [521])) "/app/example.cpp":11:13 91 {*movqi_internal}
>  (nil))
> (insn 97 450 385 2 (parallel [
> (set (reg:TI 4 si [orig:337 _32 ] [337])
> (ashift:TI (const_int 1671291085 [0x639de0cd])
> (reg:QI 2 cx [521])))
> (clobber (reg:CC 17 flags))
> ]) "/app/example.cpp":11:13 953 {ashlti3_doubleword}
>  (expr_list:REG_EQUIV (mem:TI (plus:DI (reg/f:DI 19 frame)
> (const_int -80 [0xffb0])) [2  S16 A128])
> (expr_list:REG_EQUAL (ashift:TI (const_int 1671291085 [0x639de0cd])
> (reg:QI 38 r10 [521]))
> (nil
> ```
It should be already invalid insn after reload since 1671291085 is not
reg_or_pm1_operand, guess reload have't check predicate, but only check for
constaint?

14775(define_insn "ashl3_doubleword"
14776  [(set (match_operand:DWI 0 "register_operand" "=,")
14777(ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
14778(match_operand:QI 2 "nonmemory_operand" "c,c")))

Before reload it's ok

I'm testing below which can fix the issue.

3537(insn 98 94 387 2 (parallel [
3538(set (reg:TI 337 [ _32 ])
3539(ashift:TI (reg:TI 329)
3540(reg:QI 521)))
3541(clobber (reg:CC 17 flags))
3542]) "test.c":11:13 953 {ashlti3_doubleword}

diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
index 7508d7a58bd..e1d70162b88 100644
--- a/gcc/config/i386/constraints.md
+++ b/gcc/config/i386/constraints.md
@@ -225,9 +225,8 @@ (define_constraint "Bz"

 (define_constraint "BC"
   "@internal integer SSE constant with all bits set operand."
-  (and (match_test "TARGET_SSE")
-   (ior (match_test "op == constm1_rtx")
-   (match_operand 0 "vector_all_ones_operand"
+  (ior (match_test "op == constm1_rtx")
+   (match_operand 0 "vector_all_ones_operand")))

 (define_constraint "BF"
   "@internal floating-point SSE constant with all bits set operand."
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 6207036a2a0..9c4e847fba1 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14774,7 +14774,7 @@ (define_insn_and_split "*ashl3_doubleword_mask_1"

 (define_insn "ashl3_doubleword"
   [(set (match_operand:DWI 0 "register_operand" "=,")
-   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0n,r")
+   (ashift:DWI (match_operand:DWI 1 "reg_or_pm1_operand" "0BC,r")
(match_operand:QI 2 "nonmemory_operand" "c,c")))
(clobber (reg:CC FLAGS_REG))]
   ""

[Bug target/96846] [x86] Prefer xor/test/setcc over test/setcc/movzx sequence

2024-07-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96846

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #5 from Hongtao Liu  ---
Just note, with -mapxf, gcc now generates

cmp edx, 5
setzune dl

[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine

2024-07-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978

--- Comment #10 from Hongtao Liu  ---
(In reply to H.J. Lu from comment #9)
> (In reply to Hongtao Liu from comment #8)
> > Fixed in GCC15,thanks H.J.
> 
> Does GCC 14 have the same issue with -m32 -march=native?

Yes, will backport the patch.

[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine

2024-07-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|WAITING |RESOLVED

--- Comment #8 from Hongtao Liu  ---
Fixed in GCC15,thanks H.J.

[Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2024-07-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

--- Comment #48 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #47)
> Created attachment 58746 [details]
> Accoate v2di with GPR
> 
> The attached patch can allocated V2DI with GPR to avoid spill.
> 

@Uros Is it a good idea to make GPR available for all 128-bit vector with

1) extend *movti_internal to all 128-bit vectors,  extend related splitter to
handle movement between GPR and SSE_REG, extend split_double_mode to handle
movement between GPR and GPR
2) Adjust ix86_hard_regno_mode_ok to make GPR available for all 128-bit vector
3) inline_secondary_memory_needed need to be adjust since now we support
movement between GPR and SSE for 16-bytes vector.

[Bug tree-optimization/98856] [12/13/14/15 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2024-07-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #47 from Hongtao Liu  ---
Created attachment 58746
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58746=edit
Accoate v2di with GPR

The attached patch can allocated V2DI with GPR to avoid spill.

poly_double_le2:
.LFB0:
.cfi_startproc
movq%rdi, %rdx
movq8(%rsi), %rdi
movq(%rsi), %rsi
movq%rdi, %rax
movq%rsi, %rcx
vmovq   %rsi, %xmm4
sarq$63, %rax
shrq$63, %rcx
vpinsrq $1, %rdi, %xmm4, %xmm3
andl$135, %eax
vpsllq  $1, %xmm3, %xmm1
vmovq   %rax, %xmm2
vpinsrq $1, %rcx, %xmm2, %xmm0
vpxor   %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rdx)
ret
.cfi_endproc

But when there's (subreg:V (reg:TI 0)) for other vector modes, the issue could
be still there.

[Bug c++/116064] [15 Regression] SPEC 2017 523.xalancbmk_r failed to build

2024-07-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116064

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
But does GCC have a walkaround similar as -fdelayed-template-parsing in Clang?

[Bug target/116043] [15 regression] TLS relocation issue when building glibc with -O3 -mavx512bf16 by r15-1619

2024-07-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116043

Hongtao Liu  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #15 from Hongtao Liu  ---

> I think we can exclude case when base and index are both NULL_RTX, let's
> always use *mov{si,di}_internal pattern to move const to register.

No, i misunderstood the issue, it's not the problem of lea pattern, it's the
address of gottpoff shouldn't be reloaded.

In PR103275, r12-5445-gb5844cb0bc8c7d9be2ff1ecded249cad82b9b71c added new
constraint "Bk" to avoid kmovqfoo@gottpoff(%rip), %k0, but RA may still
allocates k/v register and try to reload for address since it thought the cost
of reload address is cheap?

Adjust "Bk" to define_special_memory_constraint to avoid address reload can
solve the issue.

modified   gcc/config/i386/constraints.md   
@@ -187,7 +187,7 @@ (define_special_memory_constraint "Bm"  
   "@internal Vector memory operand."   
   (match_operand 0 "vector_memory_operand"))   

-(define_memory_constraint "Bk" 
+(define_special_memory_constraint "Bk" 
   "@internal TLS address that allows insn using non-integer registers."
   (and (match_operand 0 "memory_operand")  
(not (match_test "ix86_gpr_tls_address_pattern_p (op)" 

I'm testing the patch.

[Bug target/116043] [15 regression] TLS relocation issue when building glibc with -O3 -mavx512bf16

2024-07-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116043

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #11 from Hongtao Liu  ---
The buggy insn is like

(insn:TI 348 525 527 5 (set (reg:DI 4 si [156])
(const:DI (unspec:DI [
(symbol_ref:DI ("__libc_tsd_CTYPE_B") [flags 0x60] 
)
] UNSPEC_GOTNTPOFF))) "/app/example.c":12:37 discrim 1 258
{*leadi}
 (nil))
-

-define_insn is like--
6276(define_insn "*lea"   
 6277  [(set (match_operand:SWI48 0 "register_operand" "=r")
 6278(match_operand:SWI48 1 "address_no_seg_operand" "Ts"))]
 6279  "ix86_hardreg_mov_ok (operands[0], operands[1])" 


1346;; Return true if op is a valid address for LEA, and does not contain   
1347;; a segment override.  Defined as a special predicate to allow 
1348;; mode-less const_int operands pass to address_operand.
1349(define_special_predicate "address_no_seg_operand"  
1350  (match_test "address_operand (op, VOIDmode)") 
1351{   
1352  struct ix86_address parts;
1353  int ok;   
1354
1355  if (!CONST_INT_P (op) 
1356  && mode != VOIDmode   
1357  && GET_MODE (op) != mode) 
1358return false;   
1359
1360  ok = ix86_decompose_address (op, ); 
1361  gcc_assert (ok);  
1362  return parts.seg == ADDR_SPACE_GENERIC;   
1363})
--define_insn ends

I think we can exclude case when base and index are both NULL_RTX, let's always
use *mov{si,di}_internal pattern to move const to register.

[Bug target/115982] [15 Regression] ICE: unrecognizable insn in ira_remove_insn_scratches with -mavx512vl since r15-1742

2024-07-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115982

--- Comment #5 from Hongtao Liu  ---
Fixed by r15-2217-ga3f03891065cb9, could be latent on release branch since
GCC12

[Bug target/115982] [15 Regression] ICE: unrecognizable insn in ira_remove_insn_scratches with -mavx512vl since r15-1742

2024-07-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115982

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org

--- Comment #4 from Hongtao Liu  ---
I'll take a look

[Bug tree-optimization/115994] Vectorizer failed to do vectorizaton for .sat_trunc when nunits_in / nunits_out > 2

2024-07-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115994

--- Comment #1 from Hongtao Liu  ---
Also in vect_recog_sat_trunc_pattern 

4700  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
4701  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
4702  internal_fn fn = IFN_SAT_TRUNC;
4703
4704  if (v_itype != NULL_TREE && v_otype != NULL_TREE
4705&& direct_internal_fn_supported_p (fn, tree_pair (v_otype,
v_itype),
4706   OPTIMIZE_FOR_BOTH))
4707{
4708  gcall *call = gimple_build_call_internal (fn, 1, ops[0]);
4709  tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);


it's supposed to check for something like sstruncv8siv8hi2, but it actually
checks for sstruncv8siv16hi2 since get_vectype_for_scalar_type return same-size
vector type not same-nunit vector type.

[Bug tree-optimization/115994] New: Vectorizer failed to do vectorizaton for .sat_trunc when nunits_in / nunits_out > 2

2024-07-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115994

Bug ID: 115994
   Summary: Vectorizer failed to do vectorizaton for .sat_trunc
when nunits_in / nunits_out > 2
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

in vectorizable_call

 3324  nunits_in = TYPE_VECTOR_SUBPARTS (vectype_in);
 3325  nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
 3326  if (known_eq (nunits_in * 2, nunits_out))
 3327modifier = NARROW;
 3328  else if (known_eq (nunits_out, nunits_in))
 3329modifier = NONE;
 3330  else if (known_eq (nunits_out * 2, nunits_in))
 3331modifier = WIDEN;
 3332  else
 return false;


x86 AVX512 supports vpmovusqb/vpmovusqw/vpmovusdb, since current vectorizer
will keep same vector length, then nunits_in / nunits_out will be greater than
2 and failed vectorization for .sat_trunc.

[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine

2024-07-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978

--- Comment #6 from Hongtao Liu  ---
(In reply to H.J. Lu from comment #5)
> (In reply to Hongtao Liu from comment #4)
> > To clarify, the question originally came from whether or not to report error
> > for -m32,-march=native, and then LLVM folks said it's diffcult for LLVM not
> > issuing error for -march=native -m32, but issuing error for explicit -mapxf
> > -m32. So they want to just not issue error at all, and then comipler
> > silently disables the 64-bit only features(plus adding documents to mention
> > -m32 will disable those features).
> 
> This is no different from PR 101395.  I don't believe LLVM can't work like
> GCC.

I prefer your fix, I'll bring this to LLVM folks to rediscuss.

[Bug target/115978] [x86] GCC issues an error when using -m32 -march=native on APX available machine

2024-07-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115978

--- Comment #4 from Hongtao Liu  ---
To clarify, the question originally came from whether or not to report error
for -m32,-march=native, and then LLVM folks said it's diffcult for LLVM not
issuing error for -march=native -m32, but issuing error for explicit -mapxf
-m32. So they want to just not issue error at all, and then comipler silently
disables the 64-bit only features(plus adding documents to mention -m32 will
disable those features).

[Bug tree-optimization/114966] fails to optimize avx2 in-register permute written with std::experimental::simd

2024-07-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114966

--- Comment #5 from Hongtao Liu  ---
I saw pass_eras optimize BIT_FIELD_REF of big memory into load from small
memory


Created a replacement for D.161366 offset: 0, size: 64: SR.20D.170101
Created a replacement for D.161366 offset: 64, size: 64: SR.21D.170102
Created a replacement for D.161366 offset: 128, size: 64: SR.22D.170103
Created a replacement for D.161547 offset: 0, size: 256: SR.23D.170104


  _8 = BIT_FIELD_REF ;
_9 = BIT_FIELD_REF ;
_10 = BIT_FIELD_REF ;
  _11 = {0, _8, _9, _10};

to 

  SR.20_3 = MEM  [(struct simd *)];
  SR.21_13 = MEM  [(struct simd *) + 8B];
  SR.22_14 = MEM  [(struct simd *) + 16B];
  _7 = SR.20_3;
  _8 = SR.21_13;
  _9 = SR.22_14;
  _10 = {0, _7, _8, _9};


So I guess for the later GCC somehow can't be sure the whole 256-bit memory is
valid and fail to optimize it with vec_perm_expr?

[Bug middle-end/115863] [15 Regression] zlib-1.3.1 miscompilation since r15-1936-g80e446e829d818

2024-07-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115863

Hongtao Liu  changed:

   What|Removed |Added

 CC||lin1.hu at intel dot com

--- Comment #16 from Hongtao Liu  ---

> Unfortunately, x86 has no vector mode .SAT_TRUNC instruction.
No, AVX512 supports both signed and unsigned saturation
vpmovsdb:vpmovusdb
vpmovsdw:vpmovusdw
vpmovsqb:vpmovusqb
vpmovsqd:vpmovusqd
vpmovsqw:vpmovusqw
vpmovswb:vpmovuswb
vpmovsdb:vpmovusdb

and we're working on a patch to support that.

[Bug target/113711] APX instruction set and instructions longer than 15 bytes (assembly warning)

2024-07-16 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113711

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
 CC||liuhongt at gcc dot gnu.org

--- Comment #12 from Hongtao Liu  ---
Fixed in GCC14.

[Bug target/113733] Invalid APX TLS code squence

2024-07-16 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113733
Bug 113733 depends on bug 113711, which changed state.

Bug 113711 Summary: APX instruction set and instructions longer than 15 bytes 
(assembly warning)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113711

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/115843] [14/15 Regression] 531.deepsjeng_r fails to verify with -O3 -march=znver4 --param vect-partial-vector-usage=2

2024-07-16 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115843

--- Comment #10 from Hongtao Liu  ---

> But using kmovw for QImode mask is not correct as we don't know the value in
> gpr. Perhaps we'd consider restrict the kmovb under avx512dq only.

Why? as long as we only care about lower 8 bits, vmovw should be fine.

[Bug tree-optimization/115843] [14/15 Regression] 531.deepsjeng_r fails to verify with -O3 -march=znver4 --param vect-partial-vector-usage=2

2024-07-16 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115843

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #9 from Hongtao Liu  ---
Observed one miss-optimization:

kxorw   %k4, %k4, %k4 # 262 [c=4 l=4]  *movqi_internal/14
vmovdqu64   %zmm0, KingPressureMask1-120(%rip){%k4}   # 44 
[c=65 l=10]  avx512f_storev8di_mask
vmovdqu64   %zmm0, KingPressureMask1-56(%rip){%k4}# 47   
[c=65 l=10]  avx512f_storev8di_mask

when mask is 0, maskstore can be optimized off.

[Bug tree-optimization/115872] [12/13/14/15 regression] ICE in fab pass (error: missing definition with -g & -O3)

2024-07-15 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115872

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC12.5, GCC13.4, GCC14.2 and main trunk.

[Bug target/115889] [15 Regression] FAIL: gcc.dg/vect/vect-vfa-03.c execution test with -march=znver4 --param vect-partial-vector-usage=1 since r15-1368-g6d0b7b69d14302

2024-07-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115889

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
 CC||liuhongt at gcc dot gnu.org

--- Comment #9 from Hongtao Liu  ---
Fixed in GCC15.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-07-14 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 115889, which changed state.

Bug 115889 Summary: [15 Regression] FAIL: gcc.dg/vect/vect-vfa-03.c execution 
test with -march=znver4 --param vect-partial-vector-usage=1 since 
r15-1368-g6d0b7b69d14302
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115889

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake

2024-07-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842

--- Comment #3 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #2)
> Bisected to r15-1673-gb8153b5417bed0, the commit fixed wrong rtx_cost of
> r15-882-g1d6199e5f8c1c0 which happened to improved 548.exchange_r.

Looks like wrong rtx_cost of mem somehow get better RA and has less spills in
the hot loop.

[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake

2024-07-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|UNCONFIRMED
 Ever confirmed|1   |0
   Last reconfirmed|2024-07-11 00:00:00 |
   Assignee|liuhongt at gcc dot gnu.org|unassigned at gcc dot 
gnu.org

--- Comment #2 from Hongtao Liu  ---
Bisected to r15-1673-gb8153b5417bed0, the commit fixed wrong rtx_cost of
r15-882-g1d6199e5f8c1c0 which happened to improved 548.exchange_r.

[Bug tree-optimization/115872] [12/13/14/15 regression] ICE in fab pass (error: missing definition with -g & -O3)

2024-07-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115872

Hongtao Liu  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Hongtao Liu  ---
Mine.

[Bug target/115842] [15 Regression] 6.5% slowdown of 548.exchange2_r on Intel Ice Lake

2024-07-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115842

Hongtao Liu  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-11
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |liuhongt at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Hongtao Liu  ---
I'll take a look.

[Bug tree-optimization/115833] SLP of signed short multiply goes wrong

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833

Hongtao Liu  changed:

   What|Removed |Added

 CC||lin1.hu at intel dot com

--- Comment #4 from Hongtao Liu  ---
> is a bit odd for the packing.  Possibly the target lacks a truncv4siv4hi
> operation (thus the explicit zero vector).  Possibly x86 lacks a
> pack-lowpart/pack-highpart insn.

We support truncv4siv4hi2 under AVX2, w/o AVX512, it generates shufb.

15390(define_expand "trunc2"
15391  [(set (match_operand: 0 "register_operand")
15392(truncate:
15393  (match_operand:PMOV_SRC_MODE_4 1 "register_operand")))]
15394  "TARGET_AVX2"
15395{


bar(unsigned int __vector(4)):
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
ret

w/o AVX2, it's lower to 

  _12 = VEC_PACK_TRUNC_EXPR <_9, { 0, 0, 0, 0 }>;
  _13 = BIT_FIELD_REF <_12, 64, 0>;

vec_pack_trunc_expr uses packusdw with upper 16-bit cleared.

The optab can be extended to TARGET_SSSE3 which supports pshufb.

[Bug tree-optimization/115833] SLP of signed short multiply goes wrong

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
> It seems the very bad code generation is mostly from constructing the
> V4HImode vectors going via GPRs with shifts and ORs.  Possibly
> constructing a V4SImode vector and then packing to V4HImode would be
> better?

void v4hi_contruct(signed short *t, signed short tt, short tt1)
{
  t[0] = tt;
  t[1] = tt1;
  t[2] = tt1;
  t[3] = tt1;
}


void v4si_contruct(int *t, int tt, int tt2)
{
  t[0] = tt;
  t[1] = tt2;
  t[2] = tt2;
  t[3] = tt2;
}

v4hi_contruct(short*, short, short):
movzx   eax, dx
movzx   esi, si
mov rdx, rax
sal rdx, 16
or  rdx, rax
sal rdx, 16
or  rdx, rax
sal rdx, 16
or  rdx, rsi
mov QWORD PTR [rdi], rdx
ret
v4si_contruct(int*, int, int):
vmovd   xmm2, edx
vmovd   xmm3, esi
vpinsrd xmm1, xmm2, edx, 1
vpinsrd xmm0, xmm3, edx, 1
vpunpcklqdq xmm0, xmm0, xmm1
vmovdqu XMMWORD PTR [rdi], xmm0
ret

both vmovd and vpinsrd is expensive, and v4hi_contruct is not necessary worse
than v4si_construct, but v4hi_construct can be optimized to be a little more
parallel via GPRs.

[Bug target/113312] Add attribute((no_callee_saved_registers)) for Intel FRED

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113312

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #29 from Hongtao Liu  ---
.

[Bug target/113312] Add attribute((no_callee_saved_registers)) for Intel FRED

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113312

--- Comment #28 from Hongtao Liu  ---
__attribute__((no_callee_saved_registers)) is added in GCC14.

[Bug target/113733] Invalid APX TLS code squence

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113733

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
Fixed in GCC14.

[Bug target/115115] [12/13/14/15 Regression] highway-1.0.7 wrong _mm_cvttps_epi32() constant fold

2024-07-09 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115115

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #16 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115796] [15 Regression] build failure since double_u -> __double_u change

2024-07-08 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115796

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-07-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Hongtao Liu  changed:

   What|Removed |Added

 CC||haochen.jiang at intel dot com,
   ||liuhongt at gcc dot gnu.org

--- Comment #10 from Hongtao Liu  ---
> One of the comments in PR 115756 was "I'd lean towards shift+add because for
> example Intel E-cores have a slow imul.". However, my benchmarks suggest
> that even on Intel Efficiency CPU cores the algorithm with 2 multiplication
> instructions is faster. (I used the Process Lasso tool on Windows 11 to
> force the benchmark to be run on an Efficiency CPU core).

@haocheng, could you try the benchmark on our Sierra Forest machine?
I'm ok to adjust rtx_cost of imulq for COST_N_INSNS (4) to COST_N_INSNS (3) if
the performance test looks ok.

[Bug target/115755] mulx (with -mbmi2) does not show up with constant multiply

2024-07-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115755

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #1 from Hongtao Liu  ---
mulx doesn't support imm operand, a register is still needed to put 123.
mulq is used func/func1 should be ok.

[Bug target/115756] default tuning for x86_64 produces shifts for `*240`

2024-07-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115756

--- Comment #3 from Hongtao Liu  ---
Current rtx_cost for imulq in generic_cost is COST_N_INSNS (4), make it as
COST_N_INSNS (3) could generate imulq.


  {COSTS_N_INSNS (3),   /* cost of starting multiply for QI */
   COSTS_N_INSNS (4),   /*   HI */
   COSTS_N_INSNS (3),   /*   SI */
   COSTS_N_INSNS (4),   /*   DI */

[Bug target/115748] [15 Regression] gcc.target/i386/avx512bw-pr70509.c SIGILL with -m32

2024-07-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115748

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2024-07-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #23 from Hongtao Liu  ---
(In reply to edison from comment #22)
> for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg,
> all  p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if
> comment out it all p-core usage will up to 60%.
> 
> 607.cactuBSSN_s on i9-13900K
> gcc 14.1
> 
> preENV_GOMP_CPU_AFFINITY = 0-23：   60.1 (-41.7 % slower)
> # preENV_GOMP_CPU_AFFINITY = 0-23： 103
> 
> but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need
> preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core
> first).

Because E-core run slower than P-core, if you bind the thread to each core, it
prevents threads from migrating from the E-core to the P-core.

[Bug target/115748] [15 Regression] gcc.target/i386/avx512bw-pr70509.c SIGILL with -m32

2024-07-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115748

Hongtao Liu  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-02

--- Comment #2 from Hongtao Liu  ---
We can add move that part into a separate function and add target attribute for
that.

[Bug target/107432] __builtin_convertvector generates inefficient code

2024-07-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107432

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||liuhongt at gcc dot gnu.org
 Status|NEW |RESOLVED

--- Comment #13 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/114189] Target implements obsolete vcond{,u,eq} expanders

2024-06-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114189
Bug 114189 depends on bug 115517, which changed state.

Bug 115517 Summary: Fix x86 regressions after dropping uses of 
vcond{,u,eq}_optab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #15 from Hongtao Liu  ---
Fixed.

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #14 from Hongtao Liu  ---
regressions above SSE4.1 are fxed in GCC15, SSE2 regressions are tracked in
PR115683

[Bug target/115610] -flate-combine disabled by default for x86 port

2024-06-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Hongtao Liu  ---
Fixed by r15-1735-ge62ea4fb8ffcab

[Bug tree-optimization/115693] 8 std::byte std::array comparison potential missed optimization

2024-06-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115693

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #6 from Hongtao Liu  ---

> > 
> > So it makes more sense to fix this in the optimization passes, instead of
> > ad-hoc hack in libstdc++.
> > 
> > But I'm not sure if there already exists a dup.
> 
> Let's keep this bug for the above testcase(s).  For test() the issue is
> that even with SSE4.1 we don't seem to support ptest for V8QImode?
With SSE4.1 and above, We can support cbranchv8qi(and other 32/64-bit vector)
with pmovzxv8qiv8hi + cbranchv8hi.

[Bug middle-end/115675] [15 Regression] truncv4hiv4qi affect r14-1402-gd8545fb2c71683's optimization.

2024-06-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115675

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
(In reply to Richard Biener from comment #1)
> so it's now SLP vectorized?

Yes, the vectorization looks not reasonable. it used to be vectorized as v4qi
vector CTOR +  v4qi vector store. Now it's vectorized as v4hi vector CTOR +
truncv4hiv4qi + v4qi vector store.

[Bug target/115683] New: SSE2 regressions after obselete of vcond{,u,eq}.

2024-06-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115683

Bug ID: 115683
   Summary: SSE2 regressions after obselete of vcond{,u,eq}.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

Whole failure list.
g++: g++.target/i386/pr100637-1b.C  -std=gnu++14  scan-assembler-times pcmpeqb
2
g++: g++.target/i386/pr100637-1b.C  -std=gnu++17  scan-assembler-times pcmpeqb
2
g++: g++.target/i386/pr100637-1b.C  -std=gnu++20  scan-assembler-times pcmpeqb
2
g++: g++.target/i386/pr100637-1b.C  -std=gnu++98  scan-assembler-times pcmpeqb
2
g++: g++.target/i386/pr100637-1w.C  -std=gnu++14  scan-assembler-times pcmpeqw
2
g++: g++.target/i386/pr100637-1w.C  -std=gnu++17  scan-assembler-times pcmpeqw
2
g++: g++.target/i386/pr100637-1w.C  -std=gnu++20  scan-assembler-times pcmpeqw
2
g++: g++.target/i386/pr100637-1w.C  -std=gnu++98  scan-assembler-times pcmpeqw
2
g++: g++.target/i386/pr103861-1.C  -std=gnu++14  scan-assembler-times pcmpeqb 2
g++: g++.target/i386/pr103861-1.C  -std=gnu++17  scan-assembler-times pcmpeqb 2
g++: g++.target/i386/pr103861-1.C  -std=gnu++20  scan-assembler-times pcmpeqb 2
g++: g++.target/i386/pr103861-1.C  -std=gnu++98  scan-assembler-times pcmpeqb 2
gcc: gcc.target/i386/pr88540.c scan-assembler minpd



There're extra 1 pcmpeq instruction generated in below 3 testcase for
comparison of GTU, x86 doesn't support native GTU comparison, but use psubusw +
pcmpeq + pcmpeq, the second pcmpeq is used to negate the mask, and the negate
can be
 eliminated in vcond{,u,eq} expander by just swapping if_true and if_else.

g++: g++.target/i386/pr100637-1b.C 
g++.target/i386/pr100637-1w.C
g++: g++.target/i386/pr103861-1.C


This one maybe a little bit difficult, it's x86 specific floating point
min/max{ps,pd} which is an exact match of a > b ? a : b, and not
ieee-conformant.

gcc: gcc.target/i386/pr88540.c scan-assembler minpd

[Bug target/115462] [15 regression] 416.gamess regressed 4-6% on x86_64 since r15-882-g1d6199e5f8c1c0

2024-06-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Hongtao Liu  ---
Fixed in GCC15.

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2024-06-27 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 115462, which changed state.

Bug 115462 Summary: [15 regression] 416.gamess regressed 4-6% on x86_64 since 
r15-882-g1d6199e5f8c1c0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/115450] [15 Regression] cpu2017 502.gcc runtime miscompute on aarch64 with SVE since r15-1006-gd93353e6423eca

2024-06-26 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115450

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #5 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> >[r15-1006-gd93353e6423eca] Do single-lane SLP discovery for reductions
> 
> 
> Interesting because PR 115256 bisect it to an earlier patch.

For PR 115256, the issue is fixed after adding -fno-strict-aliasing.

[Bug target/115610] -flate-combine disabled by default for x86 port

2024-06-24 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115610

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
   Last reconfirmed||2024-06-24
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #1 from Hongtao Liu  ---
Thanks, I'll take a look.

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-23 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

--- Comment #7 from Hongtao Liu  ---

> 
> BTW, when assign -1 to vector(1) , should the upper bit be
> cleared? Look like only 1 element boolean vector is cleared, but not
> vector(2) .
> If the upper bits are not cleared, both 2 cases are equal.

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 710d697c021..0f045f851d1 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8077,7 +8077,7 @@ native_encode_vector_part (const_tree expr, unsigned char
*ptr, int len,
 {
   tree itype = TREE_TYPE (TREE_TYPE (expr));
   if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (expr))
-  && TYPE_PRECISION (itype) <= BITS_PER_UNIT)
+  && TYPE_PRECISION (itype) < BITS_PER_UNIT)
 {
   /* This is the only case in which elements can be smaller than a byte.
 Element 0 is always in the lsb of the containing byte.  */


Can fix this.

It looks like it supposed to handle for itype *less than* but not *less equal*
BITS_PER_UNIT?

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #6 from Hongtao Liu  ---
(In reply to rguent...@suse.de from comment #5)
> On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> > 
> > --- Comment #4 from Hongtao Liu  ---
> > (In reply to rguent...@suse.de from comment #3)
> > > On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> > > > 
> > > > --- Comment #2 from Hongtao Liu  ---
> > > > (In reply to Richard Biener from comment #1)
> > > > > Btw, I had opened PR115490 with my results for this already.  Some 
> > > > > mitigation
> > > > > should be from optimizing ISEL expansion to vcond_mask and I'd start 
> > > > > with
> > > > > looking at some of the fallout from that side (note that might require
> > > > > the backend reject not natively implemented vec_cmp via its operand 1
> > > > > predicate)
> > > > 
> > > > w/o AVX512, vector integer comparison only supports EQ/GT, others 
> > > > comparison
> > > > rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + 
> > > > eq +
> > > > negative the vector mask)
> > > > If we restrict the predicate of operand 1, would middle-end reject
> > > > vectorization (or lower it to scalar version)?
> > > 
> > > Richard suggests that we implement the "obvious" transforms like
> > > inversion in the middle-end but if for example unsigned compares
> > > are not supported the us_minus + eq + negative trick isn't on
> > > that list.
> > > 
> > > The main reason to restrict vec_cmp would be to avoid
> > > a <= b ? c : d going with an unsupported vec_cmp but instead
> > > do a > b ? d : c - the alternative is trying to fix this
> > > on the RTL side via combine.  I understand the non-native
> > 
> > Yes, I have a patch which can fix most regressions via pattern match in
> > combine.
> > Still there is a situation that is difficult to deal with, mainly the
> > optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only exists 
> > under
> > sse4.1, w/o sse4.1, it takes 3 instructions (pand,pandn,por) to simulate the
> > vcond_mask, and the combine matches up to 4 instructions, which makes it
> > currently impossible to use the combine to recover those optimizations in 
> > the
> > vcond{,u,eq}.i.e min/max.
> > In the case of sse 4.1 and above, there is basically no regression anymore.
> 
> Maybe it's possible to use a define_insn_and_split for blends w/o SSE 4.1?
> That would allow combine matching the high-level blend operation and
> we'd only lower it afterwards?  The question is what we lose in
> combinations of/into the loweredn pand/pandn/por of course.
I'd rather live with those regressions since they're only existed below sse4.1.
> 
> Maybe it's possible to catch the higher-level optimization (min/max)
> on the GIMPLE level instead?
For integral part, I believe the optimization is already there at gimple level.
For floating point part, x86 {max,min}{ps,pd} is not ieee-conformant, it's a
exact match of cond_expr a < b ? a : b (w/ consideration of -0.0 and NAN.)

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #4 from Hongtao Liu  ---
(In reply to rguent...@suse.de from comment #3)
> On Tue, 18 Jun 2024, liuhongt at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517
> > 
> > --- Comment #2 from Hongtao Liu  ---
> > (In reply to Richard Biener from comment #1)
> > > Btw, I had opened PR115490 with my results for this already.  Some 
> > > mitigation
> > > should be from optimizing ISEL expansion to vcond_mask and I'd start with
> > > looking at some of the fallout from that side (note that might require
> > > the backend reject not natively implemented vec_cmp via its operand 1
> > > predicate)
> > 
> > w/o AVX512, vector integer comparison only supports EQ/GT, others comparison
> > rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq +
> > negative the vector mask)
> > If we restrict the predicate of operand 1, would middle-end reject
> > vectorization (or lower it to scalar version)?
> 
> Richard suggests that we implement the "obvious" transforms like
> inversion in the middle-end but if for example unsigned compares
> are not supported the us_minus + eq + negative trick isn't on
> that list.
> 
> The main reason to restrict vec_cmp would be to avoid
> a <= b ? c : d going with an unsupported vec_cmp but instead
> do a > b ? d : c - the alternative is trying to fix this
> on the RTL side via combine.  I understand the non-native

Yes, I have a patch which can fix most regressions via pattern match in
combine.
Still there is a situation that is difficult to deal with, mainly the
optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only exists under
sse4.1, w/o sse4.1, it takes 3 instructions (pand,pandn,por) to simulate the
vcond_mask, and the combine matches up to 4 instructions, which makes it
currently impossible to use the combine to recover those optimizations in the
vcond{,u,eq}.i.e min/max.
In the case of sse 4.1 and above, there is basically no regression anymore.


the regression testcases w/o sse4.1

FAIL: g++.target/i386/pr100637-1b.C  -std=gnu++14  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr100637-1b.C  -std=gnu++17  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr100637-1b.C  -std=gnu++20  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr100637-1b.C  -std=gnu++98  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr100637-1w.C  -std=gnu++14  scan-assembler-times pcmpeqw
2
FAIL: g++.target/i386/pr100637-1w.C  -std=gnu++17  scan-assembler-times pcmpeqw
2
FAIL: g++.target/i386/pr100637-1w.C  -std=gnu++20  scan-assembler-times pcmpeqw
2
FAIL: g++.target/i386/pr100637-1w.C  -std=gnu++98  scan-assembler-times pcmpeqw
2
FAIL: g++.target/i386/pr103861-1.C  -std=gnu++14  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr103861-1.C  -std=gnu++17  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr103861-1.C  -std=gnu++20  scan-assembler-times pcmpeqb
2
FAIL: g++.target/i386/pr103861-1.C  -std=gnu++98  scan-assembler-times pcmpeqb
2
FAIL: gcc.target/i386/pr88540.c scan-assembler minpd

[Bug target/115517] Fix x86 regressions after dropping uses of vcond{,u,eq}_optab

2024-06-18 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

--- Comment #2 from Hongtao Liu  ---
(In reply to Richard Biener from comment #1)
> Btw, I had opened PR115490 with my results for this already.  Some mitigation
> should be from optimizing ISEL expansion to vcond_mask and I'd start with
> looking at some of the fallout from that side (note that might require
> the backend reject not natively implemented vec_cmp via its operand 1
> predicate)

w/o AVX512, vector integer comparison only supports EQ/GT, others comparison
rtx_cost is transformed to that. (.i.e GTU is emulated with us_minus + eq +
negative the vector mask)
If we restrict the predicate of operand 1, would middle-end reject
vectorization (or lower it to scalar version)?

[Bug target/115517] New: Fix regression after dropping uses of vcond{,u,eq}_optab

2024-06-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115517

Bug ID: 115517
   Summary: Fix regression after dropping uses of
vcond{,u,eq}_optab
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
Depends on: 114189
  Target Milestone: ---
Target: x86_64-*-* i?86-*-*

> I'd appreciate testing, I do not expect fallout for x86 or arm/aarch64.
> > I know riscv doesn't implement any of the legacy optabs.  But less
> > maintained vector targets might need adjustments.
> >
> At GCC14, I tried to remove these expanders in the x86 backend, and it
> regressed some testcases, mainly because of the optimizations we did
> in ix86_expand_{int,fp}_vcond.
> I've started testing your patch, it's possible that we still need to
> move the ix86_expand_{int,fp}_vcond optimizations to the
> middle-end(isel or match.pd)or add extra patterns to handle it at the
> rtl pas_combine.
These are new failures I got

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-not vpcmpgt[bdq]

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vblendvpd 4

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vblendvps 4

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vpblendvb 2

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-not vpcmpgt[bdq]

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vblendvpd 4

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vblendvps 4

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vpblendvb 2

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++14

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++14

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++17

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++17

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++20

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++20

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++98

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++98

g++scan-assembler-times vminph 3

g++: g++.target/i386/pr100637-1b.C  -std=gnu++14  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++17  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++20  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++98  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++14  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++17  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++20  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++98  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++14  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++17  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++20  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++98  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr61747.C  -std=gnu++14  scan-assembler-times max 4

g++: g++.target/i386/pr61747.C  -std=gnu++14  scan-assembler-times min 4

g++: g++.target/i386/pr61747.C  -std=gnu++17  scan-assembler-times max 4

g++: g++.target/i386/pr61747.C  -std=gnu++17  scan-assembler-times min 4

g++: g++.target/i386/pr61747.C  -std=g

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-06-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #5 from Hongtao Liu  ---
It's fixed by r15-1100-gec985bc97a0157

[Bug target/115463] [15 regression] 526.blender_r regressed 5% on Zen2 with -Ofast -flto -march=native since r15-1058-gc989e59fc99d99

2024-06-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115463

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
should be fixed by r15-1293-g83a765768510d1f329887116757d6818d7846717.

[Bug target/115462] [15 regression] 416.gamess regressed 4-6% on x86_64 since r15-882-g1d6199e5f8c1c0

2024-06-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115462

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
(In reply to Richard Biener from comment #1)
> it might possibly affect IVOPTs

Probably, we're investigating.

[Bug target/115452] ICE when dump stv2 for gcc.target/i386/pr70322-2.c with -march=cascadelake

2024-06-12 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115452

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Hongtao Liu  ---
Fixed in GCC15.

[Bug target/115452] New: ICE when dump stv2 for gcc.target/i386/pr70322-2.c with -march=cascadelake

2024-06-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115452

Bug ID: 115452
   Summary: ICE when dump stv2 for gcc.target/i386/pr70322-2.c
with -march=cascadelake
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

gcc -m32 -march=cascadelake ./gcc/testsuite/gcc.target/i386/pr70322-2.c -mstv
-mno-bmi -S -Os -fdump-rtl-stv2-details

./gcc/testsuite/gcc.target/i386/pr70322-2.c: In function ‘foo’:
./gcc/testsuite/gcc.target/i386/pr70322-2.c:12:1: internal compiler error: RTL
check: expected code 'reg', have 'subreg' in rhs_regno, at rtl.h:1934
   12 | }
  | ^
0x88ef75 rtl_check_failed_code1(rtx_def const*, rtx_code, char const*, int,
char const*)
./gcc/rtl.cc:770
0x96be78 rhs_regno(rtx_def const*)
./gcc/rtl.h:1934
0x96cd8d rhs_regno(rtx_def const*)
./genrtl.h:38
0x96cd8d convert_op
./gcc/config/i386/i386-features.cc:1056
0x1af7711 convert_insn
./gcc/config/i386/i386-features.cc:1468
0x1af9808 convert
./gcc/config/i386/i386-features.cc:1987
0x1af9808 convert_scalars_to_vector
./gcc/config/i386/i386-features.cc:2536
0x1af9808 execute
   ./gcc/config/i386/i386-features.cc:2750


cut from i386-features.cc:1056---

  if (dump_file)
fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
 INSN_UID (insn), REGNO (tmp));
--cut end---

Looks like tmp is SUBREG.

[Bug rtl-optimization/115384] [15 Regression] ICE: RTL check: expected code 'const_int', have 'const_wide_int' in simplify_binary_operation_1, at simplify-rtx.cc:4088 since r15-1047-g7876cde25cbd2f

2024-06-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115384

Hongtao Liu  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao Liu  ---
Fixed.

[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails

2024-06-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365

--- Comment #7 from Hongtao Liu  ---
+/* { dg-final { scan-rtl-dump-times {(?n)^(?!.*REG_EQUIV)(?=.*\(fix:SI)} 3
"final" } }  */

Does this fix the testcase on solaris2?

[Bug target/115418] Extra movapd emitted for MAX implementation

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115418

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #2)
> Note the issue is ix86_expand_sse_fp_minmax only handles LT/UNGE but it
> should handle GT/UNLT with both parts swapped (comparison and true/false).
> 
GT/UNLT is "canonicalized" to GT/UNGT in ix86_prepare_sse_fp_compare_args
 4410case GE:
 4411case GT:
 4412case UNLE:
 4413case UNLT:
 4414  /* These are not supported directly before AVX, and furthermore
 4415 ix86_expand_sse_fp_minmax only optimizes LT/UNGE.  Swap the
 4416 comparison operands to transform into something that is
 4417 supported.  */
 4418  std::swap (*pop0, *pop1);
 4419  code = swap_condition (code);

[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365

Hongtao Liu  changed:

   What|Removed |Added

 Target|powerpc64le-linux-gnu,  |powerpc64le-linux-gnu,
   |sparc-sun-solaris2.11   |sparc-sun-solaris2.11,
   ||arm-eabi, cortex-m0

--- Comment #6 from Hongtao Liu  ---
Also failed arm-eabi cortex-m0

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

--- Comment #6 from Hongtao Liu  ---
For 1 element vector, when backend doesn't support it's vector mode, the scalar
mode is used for the type, which makes expand_vec_cond_expr_p use QImode for
icode check.(vcond_mask_qiqi)

It could also be the case when both data type and cmp_type are
vector_boolean_type.

It looks like vcond_mask_qiqi is dichotomous.
For the former, it should be 
  operands[3] == 1 ? operands[1] : operands[2]

since mask is vector 1 boolean.

For the latter, it should be
 (operand[1] & operand[3]) | (operand[2] & ~operand[3]) 


BTW, when assign -1 to vector(1) , should the upper bit be
cleared? Look like only 1 element boolean vector is cleared, but not vector(2)
.
If the upper bits are not cleared, both 2 cases are equal.

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

--- Comment #5 from Hongtao Liu  ---
>   _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>;

Hmm, it should check vcond_mask_qiv1qi instead of vcond_mask_qiqi, I guess
since the backend doesn't supports v1qi, TYPE_MODE of V is QImode, then it
wrongly checked vcond_mask_qiqi.

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

--- Comment #4 from Hongtao Liu  ---

> 
> and for _2 = VIEW_CONVERT_EXPR(_1); we explicitly
> clear the upper bits due to PR113576, and then we get 1 hit the abort.
It's not VIEW_CONVERT_EXPR clear the uppper bits, but _1 = { -1 };

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

--- Comment #3 from Hongtao Liu  ---
typedef __attribute__((__vector_size__ (1))) char V;

char
foo (V v)
{
  return ((V) v == v)[0];
}

int
main ()
{
  char x = foo ((V) { });
  if (x != -1)
__builtin_abort ();
}

w/ vcond_mask_qiqi, it's not lowered by veclower, and we get

char foo (V v)
{
  vector(1) signed char D.5142;
  char D.5141;
  vector(1)  _1;
  vector(1) signed char _2;
  char _5;

   :
  _1 = { -1 };
  _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>;
  D.5142 = _2;
  _5 = VIEW_CONVERT_EXPR(D.5142);

   :
:
  return _5;
}

But it's further simplified to 

char foo (V v)
{
  vector(1) signed char D.3765;
  char D.3764;
  vector(1)  _1;
  vector(1) signed char _2;
  char _5;

   :
  _1 = { -1 };
  _2 = VIEW_CONVERT_EXPR(_1);
  D.3765 = _2;
  _5 = VIEW_CONVERT_EXPR(D.3765);

   :
:
  return _5;

}

by isel

and for _2 = VIEW_CONVERT_EXPR(_1); we explicitly clear
the upper bits due to PR113576, and then we get 1 hit the abort.

It sound to me 
  _1 = { -1 };
  _2 = VEC_COND_EXPR <_1, { -1 }, { 0 }>;
shouldn't be simplified to 
_2 = VIEW_CONVERT_EXPR(_1);

when nunits is less than mode precision since the upper bit will be cleared.

[Bug target/115406] [15 Regression] wrong code with vector compare at -O0 with -mavx512f since r15-920-gb6c6d5abf0d31c

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115406

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #2 from Hongtao Liu  ---
I'll take a look.

[Bug rtl-optimization/115384] [15 Regression] ICE: RTL check: expected code 'const_int', have 'const_wide_int' in simplify_binary_operation_1, at simplify-rtx.cc:4088 since r15-1047-g7876cde25cbd2f

2024-06-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115384

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #3 from Hongtao Liu  ---
Mine.

[Bug testsuite/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails

2024-06-06 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Hongtao Liu  ---
Should be fixed by r15-1088-gb24f2954dbc13d

[Bug testsuite/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails

2024-06-06 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365

--- Comment #5 from Hongtao Liu  ---
(In reply to Rainer Orth from comment #4)
> Unfortunately, the fix broke 32-bit Solaris/SPARC in exchange:
> 
> FAIL: gcc.dg/pr100927.c scan-rtl-dump-times final "(?n)(fix:SI" 3
> 
/* { dg-final { scan-rtl-dump-times {(?n)^[ \t]*\(fix:SI} 3 "final" } }  */
The new fix is to check there're only space or tab before (fix:SI, and use "^[
\t]*", so does solaris use ^ as line header?

I try grep "^[ \t]*(fix:SI" your.dump

(fix:SI (fix:SF (reg:SF 40 %f8 [111]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":13:1 213
{fix_truncsfsi2}
(fix:SI (fix:SF (reg:SF 40 %f8 [111]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":22:1 213
{fix_truncsfsi2}
(fix:SI (fix:SF (reg:SF 40 %f8 [111]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":31:1 213
{fix_truncsfsi2}
(fix:SI (fix:SF (reg:SF 40 %f8 [112]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":12:10 213
{fix_truncsfsi2}
(fix:SI (fix:SF (reg:SF 40 %f8 [112]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":21:10 213
{fix_truncsfsi2}
(fix:SI (fix:SF (reg:SF 40 %f8 [112]
"/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/pr100927.c":30:10 213
{fix_truncsfsi2}

And it works on my x86-pc-linux-gnu machine.

[Bug target/115370] [15 regression] gcc.target/i386/pr77881.c FAIL

2024-06-06 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115370

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
We can add a target_hook, targetm.support_ccmp_p, default implementation can be
targetm.gen_ccmp_first == NULL

[Bug rtl-optimization/115369] New: ifcvt failed to condition elimination for__builtin_mul_overflow

2024-06-06 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115369

Bug ID: 115369
   Summary: ifcvt failed to condition elimination
for__builtin_mul_overflow
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: liuhongt at gcc dot gnu.org
  Target Milestone: ---

int
foo (unsigned a, unsigned b, unsigned d, unsigned e, int* p)
{
unsigned int r;
int c = __builtin_mul_overflow (a, b, );
d += c;
return c ? d : e;
}


(jump_insn 14 13 47 2 (set (pc)
(if_then_else (eq (reg:CCO 17 flags)
(const_int 0 [0]))
(label_ref 17)
(pc))) "/app/example.c":5:13 1212 {*jcc}
 (expr_list:REG_DEAD (reg:CCO 17 flags)
(int_list:REG_BR_PROB 536868 (nil)))
 -> 17)
(note 47 14 17 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
  ; pc falls through to BB 5
(code_label 17 47 40 4 3 (nil) [1 uses])
(note 40 17 29 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 29 40 30 4 (parallel [
(set (reg/v:SI 105 [ e ])
(plus:SI (reg/v:SI 104 [ d ])
(const_int 1 [0x1])))
(clobber (reg:CC 17 flags))
]) "/app/example.c":6:7 272 {*addsi_1}
 (expr_list:REG_DEAD (reg/v:SI 104 [ d ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil
(code_label 30 29 31 5 4 (nil) [0 uses])
(note 31 30 36 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
(insn 36 31 37 5 (set (reg/i:SI 0 ax)
(reg/v:SI 105 [ e ])) "/app/example.c":8:1 85 {*movsi_internal}
 (expr_list:REG_DEAD (reg/v:SI 105 [ e ])
(nil)))
(insn 37 36 0 5 (use (reg/i:SI 0 ax)) "/app/example.c":8:1 -1
 (nil))

The ce2 dump looks quite simple, not sure why it failed.

[Bug target/43618] Incorrect sse2_cvtX2Y pattern

2024-06-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43618

Hongtao Liu  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Hongtao Liu  ---
The pattern issue is fixed in GCC13.1 and later.

[Bug target/43618] Incorrect sse2_cvtX2Y pattern

2024-06-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43618

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
The pattern issue is fixed in GCC13.1 and later.

[Bug other/115365] New test case gcc.dg/pr100927.c from r15-1022-gb05288d1f1e4b6 fails

2024-06-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115365

--- Comment #1 from Hongtao Liu  ---
pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120])))
"../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":12:10
428 {*fix_truncsfsi2_p8}
pr100927.c.349r.final: (expr_list:REG_EQUIV (fix:SI (const_double:SF
2.147483648e+9 [0x0.8p+32]))
pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120])))
"../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":21:10
428 {*fix_truncsfsi2_p8}
pr100927.c.349r.final: (expr_list:REG_EQUIV (fix:SI (const_double:SF -Inf
[-Inf]))
pr100927.c.349r.final:(fix:SI (reg:SF 32 0 [120])))
"../../gcc/intel-innersource/pr115365/gcc/testsuite/gcc.dg/pr100927.c":30:10
428 {*fix_truncsfsi2_p8}

there're 5 fix:SI in the final dump.

[Bug target/114428] [x86] psrad xmm, xmm, 16 and pand xmm, const_vector (0xffff x4) can be optimized to psrld

2024-06-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114428

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Hongtao Liu  ---
Fixed in GCC15.

[Bug rtl-optimization/115351] [14/15 regression] pointless movs when passing by value on x86-64

2024-06-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115351

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #2 from Hongtao Liu  ---
There're 

(insn 5 4 6 2 (set (reg:TI 110)
(ior:TI (and:TI (reg:TI 110)
(const_wide_int 0x))
(zero_extend:TI (subreg:DI (reg:DF 111) 0
"/app/example.cpp":8:1 136 {*insvti_lowpart_1}
 (nil))
(insn 6 5 7 2 (set (reg:TI 110)
(ior:TI (and:TI (reg:TI 110)
(const_wide_int 0x0))
(ashift:TI (zero_extend:TI (subreg:DI (reg:DF 112) 0))
(const_int 64 [0x40] "/app/example.cpp":8:1 133
{*insvti_highpart_1}
 (nil))
(insn 7 6 8 2 (set (reg/v:TI 109 [ z ])

in GCC14's rtl dump, guess related to r14-589-g1e3054d27c83ee?

[Bug target/115341] [15 regression] gcc.target/i386/apx-ndd-2.c etc. FAIL

2024-06-04 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115341

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #1 from Hongtao Liu  ---
I think it's because binutils2.42 only has initial support Intel APX: 32 GPRs,
NDD, PUSH2/POP2 and PUSHP/POPP. APX NF is on latest binutils trunk.

and target apxf only check the initial support, I guess we need to add a
separate target to check for the remaining APXF features(NF,CCMP/CTEST/CFCMOV).

[Bug other/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails

2024-06-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334

--- Comment #2 from Hongtao Liu  ---
diff --git a/gcc/testsuite/gcc.dg/vect/pr112325.c
b/gcc/testsuite/gcc.dg/vect/pr112325.c
index dea6cca3b86..143903beab2 100644
--- a/gcc/testsuite/gcc.dg/vect/pr112325.c
+++ b/gcc/testsuite/gcc.dg/vect/pr112325.c
@@ -3,6 +3,7 @@
 /* { dg-require-effective-target vect_int } */
 /* { dg-require-effective-target vect_shift } */
 /* { dg-additional-options "-mavx2" { target x86_64-*-* i?86-*-* } } */
+/* { dg-additional-options "--param max-completely-peeled-insns=200" { target
powerpc64*-*-* } } */

 typedef unsigned short ggml_fp16_t;
 static float table_f32_f16[1 << 16];

Does this patch work for you?

[Bug other/115334] new test case gcc.dg/vect/pr112325.c from r15-919-gef27b91b62c3aa fails

2024-06-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115334

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #1 from Hongtao Liu  ---
power backend set param_max_completely_peeled_insns to 400, so the inner loop
is still completed unrolled.
So the testcase needs extra option for power backend --param
max-completely-peeled-insns=200.

[Bug target/115299] [14/15 regression] pr86722.c failed to eliminate branch.

2024-06-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115299

Hongtao Liu  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15

[Bug target/113609] EQ/NE comparison between avx512 kmask and -1 can be optimized with kxortest with checking CF.

2024-06-03 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113609

Hongtao Liu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Hongtao Liu  ---
Fixed in GCC15

[Bug target/115299] [14/15 regression] pr86722.c failed to eliminate branch.

2024-05-30 Thread liuhongt at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115299

--- Comment #2 from Hongtao Liu  ---
> Maybe r14-53-g675b1a7f113adb .

Probably, current cost model may need adjustment.

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo