[Bug target/112962] [14 Regression] ICE: SIGSEGV in operator() (recog.h:431) with -fexceptions -mssse3 and __builtin_ia32_pabsd128()

2023-12-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112962

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|ubizjak at gmail dot com   |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #10 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #7)

> but with -fexceptions (and probably because we incorrectly don't mark the
> builtins nothrow?) this doesn't happen.

Maybe we should finally fix the above nothrow issue?

[Bug target/112962] [14 Regression] ICE: SIGSEGV in operator() (recog.h:431) with -fexceptions -mssse3 and __builtin_ia32_pabsd128()

2023-12-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112962

--- Comment #9 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #8)
> Of course, yet another option is:

This goes out of my (limited) area of expertise, so if my proposed (trivial)
patch is papering over some other issue, I'll happily leave the solution to
you.

[Bug target/112962] [14 Regression] ICE: SIGSEGV in operator() (recog.h:431) with -fexceptions -mssse3 and __builtin_ia32_pabsd128()

2023-12-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112962

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #3)
> I was thinking whether it wouldn't be better to expand x86 const or pure
> builtins when lhs is ignored to nothing in the expanders.

Something like this?

--cut here--
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index a53d69d5400..0f3d6108d77 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -13032,6 +13032,9 @@ ix86_expand_builtin (tree exp, rtx target, rtx
subtarget,
   unsigned int fcode = DECL_MD_FUNCTION_CODE (fndecl);
   HOST_WIDE_INT bisa, bisa2;

+  if (ignore && (TREE_READONLY (fndecl) || DECL_PURE_P (fndecl)))
+return const0_rtx;
+
   /* For CPU builtins that can be folded, fold first and expand the fold.  */
   switch (fcode)
 {
@@ -14401,9 +14404,6 @@ rdseed_step:
   return target;

 case IX86_BUILTIN_READ_FLAGS:
-  if (ignore)
-   return const0_rtx;
-
   emit_insn (gen_pushfl ());

   if (optimize
--cut here--

[Bug target/112962] [14 Regression] ICE: SIGSEGV in operator() (recog.h:431) with -fexceptions -mssse3 and __builtin_ia32_pabsd128()

2023-12-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112962

--- Comment #4 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #3)
> I was thinking whether it wouldn't be better to expand x86 const or pure
> builtins when lhs is ignored to nothing in the expanders.

Yes, this could be a better solution.

[Bug target/112962] [14 Regression] ICE: SIGSEGV in operator() (recog.h:431) with -fexceptions -mssse3 and __builtin_ia32_pabsd128()

2023-12-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112962

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
   Last reconfirmed||2023-12-12
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #1 from Uroš Bizjak  ---
Created attachment 56862
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56862&action=edit
Proposed patch

Patch in testing.

[Bug rtl-optimization/112760] [14 Regression] wrong code with -O2 -fno-dce -fno-guess-branch-probability -m8bit-idiv -mavx --param=max-cse-insns=0 and __builtin_add_overflow_p()

2023-11-29 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112760

Uroš Bizjak  changed:

   What|Removed |Added

  Component|target  |rtl-optimization
   Last reconfirmed||2023-11-29
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Target Milestone|--- |14.0

--- Comment #2 from Uroš Bizjak  ---
With the original testcase, ce1 pass is if-converting:

   20: flags:CCZ=cmp(r110:SI,r111:SI)
  REG_DEAD r111:SI
  REG_DEAD r110:SI
   21: pc={(flags:CCZ==0)?L23:pc}
  REG_DEAD flags:CCZ
   39: NOTE_INSN_BASIC_BLOCK 5
   22: r103:HI=0x1
   23: L23:

with:

IF-THEN-JOIN block found, pass 2, test 2, then 5, join 6
scanning new insn with uid = 45.
scanning new insn with uid = 44.
scanning new insn with uid = 46.
if-conversion succeeded through noce_try_cmove
Removing jump 21.
deleting insn with uid = 21.
deleting insn with uid = 22.

to:

   20: flags:CCZ=cmp(r110:SI,r111:SI)
  REG_DEAD r111:SI
  REG_DEAD r110:SI
   45: r118:HI=0x1
   44: flags:CCZ=cmp(r110:SI,r111:SI)
   46: r103:HI={(flags:CCZ==0)?r103:HI:r118:HI}

And things go downhill from here. Before postreload we have:

   20: flags:CCZ=cmp(ax:SI,dx:SI)
  REG_UNUSED flags:CCZ
   44: flags:CCZ=cmp(ax:SI,dx:SI)
  REG_DEAD dx:SI
  REG_DEAD ax:SI
   62: ax:HI=0x1
  REG_EQUIV 0x1
   46: bx:HI={(flags:CCZ==0)?bx:HI:ax:HI}
  REG_DEAD flags:CCZ
  REG_DEAD ax:HI

and in posteload pass (insn 44) is removed:

   20: flags:CCZ=cmp(ax:SI,dx:SI)
  REG_UNUSED flags:CCZ
   62: ax:HI=0x1
  REG_EQUIV 0x1
   46: bx:HI={(flags:CCZ==0)?bx:HI:ax:HI}
  REG_DEAD flags:CCZ
  REG_DEAD ax:HI

here comes pro_and_epilogue pass that detects "unused" (insn 20) and removes
it:

df_analyze called
deleting insn with uid = 20.

Confirmed as RTL optimization problem.

[Bug middle-end/112560] [14 Regression] ICE in try_combine on pr112494.c

2023-11-29 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords||patch

--- Comment #4 from Uroš Bizjak  ---
Patch at [1].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638589.html

[Bug middle-end/112560] [14 Regression] ICE in try_combine on pr112494.c

2023-11-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #3 from Uroš Bizjak  ---
Created attachment 56705
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56705&action=edit
Proposed patch

The code assumes that cc_use_loc represents a comparison operator. Skip the
modification of CC-using operation if this is not the case.

[Bug target/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0
 Resolution|--- |FIXED

--- Comment #10 from Uroš Bizjak  ---
Fixed for 14.0.

[Bug middle-end/112560] [14 Regression] ICE in try_combine on pr112494.c

2023-11-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112560
Bug 112560 depends on bug 112494, which changed state.

Bug 112494 Summary: ICE in ix86_cc_mode, at config/i386/i386.cc:16477
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/112686] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1176 with -fsplit-stack -mcmodel=large

2023-11-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112686

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED
 CC|uros at gcc dot gnu.org|

--- Comment #5 from Uroš Bizjak  ---
Fixed.

[Bug target/112672] [14 Regression] wrong code with __builtin_parityl() at -O and above on x86_64

2023-11-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112672

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|14.0|11.5

--- Comment #9 from Uroš Bizjak  ---
Fixed everywhere.

[Bug target/112686] [14 Regression] ICE: in gen_reg_rtx, at emit-rtl.cc:1176 with -fsplit-stack -mcmodel=large

2023-11-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112686

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #3 from Uroš Bizjak  ---
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 7b922857d80..50e8826dbe5 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -10503,7 +10503,7 @@ ix86_expand_split_stack_prologue (void)
  fn = copy_to_suggested_reg (x, reg11, Pmode);
}
  else
-   fn = split_stack_fn_large;
+   fn = copy_to_suggested_reg (split_stack_fn_large, reg11, Pmode);

  /* When using the large model we need to load the address
 into a register, and we've run out of registers.  So we

[Bug target/89316] ICE with -mforce-indirect-call and -fsplit-stack

2023-11-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89316

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0
 Resolution|--- |FIXED

--- Comment #16 from Uroš Bizjak  ---
Fixed for gcc-14.

[Bug target/112672] [14 Regression] wrong code with __builtin_parityl() at -O and above on x86_64

2023-11-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112672

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #4 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #3)
> parityhi2 should have:
> rtx extra = gen_reg_rtx (HImode);
> emit_move_insn (extra, operands[1]);
> emit_insn (gen_parityhi2_cmp (extra));
> 
> Or something similar because parityqi2_cmp clobbers its argument.

Exactly.

I have a patch in testing.

[Bug target/112445] [14 Regression] ICE: in lra_split_hard_reg_for, at lra-assigns.cc:1861 unable to find a register to spill: {*umulditi3_1} with -O -march=cascadelake -fwrapv since r14-4968-g89e5d90

2023-11-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112445

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #4)
> I think this goes wrong during combine.
Combine does not / should not combine moves from hard registers just because of
extending register live range. It looks that this should also include
zero-extracts and other "pseudo-move" instructions.

The relevant patch and discussion is at [1].

[1] https://gcc.gnu.org/legacy-ml/gcc-patches/2018-10/msg01356.html

[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns

2023-11-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657

--- Comment #6 from Uroš Bizjak  ---
This is by design, CMOV should not be used instead of well predicted jumps.

FYI, CMOV is quite problematic on x86, there are several PRs where conversion
to CMOV resulted in 2x slower execution. Please see e.g.:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309#c26

[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns

2023-11-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657

--- Comment #5 from Uroš Bizjak  ---
Digging a bit further:

if_info.max_seq_cost is calculated via targetm.max_noce_ifcvt_seq_cost, where
without params set we return:

  return BRANCH_COST (true, predictable_p) * COSTS_N_INSNS (2);

with:

#define BRANCH_COST(speed_p, predictable_p) \
  (!(speed_p) ? 2 : (predictable_p) ? 0 : ix86_branch_cost)

So, the conversion is clearly not desirable for well predicted jumps.

[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns

2023-11-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657

--- Comment #4 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #3)
> (In reply to Andrew Pinski from comment #2)
> 
> > Someone will have to debug ifcvt.cc to see why it fails on x86_64 but works
> > on aarch64.  Note there are some new changes to ifcvt.cc in review which
> > might improve this, though I am not sure.
> 
> x86_64 targetm.noce_conversion_profitable_p returns false for:

Actually, the cost function goes to default_noce_conversion_profitable_p,
where:

(gdb) p cost
$1 = 16
(gdb) p if_info->original_cost 
$2 = 8
(gdb) p if_info->max_seq_cost 
$3 = 0

For some reason, max_seq_cost remains zero, while on aarch64:

(gdb) p cost
$2 = 12
(gdb) p if_info->original_cost
$3 = 8
(gdb) p if_info->max_seq_cost
$4 = 12

So, x86_64 returns false from the default cost function:

  /* When compiling for size, we can make a reasonably accurately guess
 at the size growth.  When compiling for speed, use the maximum.  */
  return speed_p && cost <= if_info->max_seq_cost;

[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns

2023-11-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657

--- Comment #3 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #2)

> Someone will have to debug ifcvt.cc to see why it fails on x86_64 but works
> on aarch64.  Note there are some new changes to ifcvt.cc in review which
> might improve this, though I am not sure.

x86_64 targetm.noce_conversion_profitable_p returns false for:

(insn 20 0 19 (set (reg:SI 101)
(const_int -9 [0xfff7])) 85 {*movsi_internal}
 (nil))

(insn 19 20 21 (set (reg:CCZ 17 flags)
(compare:CCZ (reg/v:SI 99 [ c ])
(const_int 14 [0xe]))) 11 {*cmpsi_1}
 (nil))

(insn 21 19 0 (set (reg/v:SI 99 [ c ])
(if_then_else:SI (ne (reg:CCZ 17 flags)
(const_int 0 [0]))
(reg/v:SI 99 [ c ])
(reg:SI 101))) 1438 {*movsicc_noc}
 (nil))

[Bug target/89316] ICE with -mforce-indirect-call and -fsplit-stack

2023-11-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89316

Uroš Bizjak  changed:

   What|Removed |Added

   Keywords||patch

--- Comment #14 from Uroš Bizjak  ---
Patch at [1].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637478.html

[Bug target/89316] ICE with -mforce-indirect-call and -fsplit-stack

2023-11-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89316

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #56637|0   |1
is obsolete||

--- Comment #13 from Uroš Bizjak  ---
Created attachment 56647
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56647&action=edit
Proposed patch v2

New version, also fixes "-fsplit-stack -fpic -mforce-indirect-call" on 32-bit
targets.

[Bug target/89316] ICE with -mforce-indirect-call and -fsplit-stack

2023-11-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89316

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #12 from Uroš Bizjak  ---
Created attachment 56637
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56637&action=edit
Proposed patch

Patch that implements ideas from Comment 7 and Comment 8.

[Bug target/111657] Memory copy with structure assignment from named address space should be improved

2023-11-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

--- Comment #9 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #8)
> I'd say it is a user error to invoke memcpy/memset etc. with pointers to
> non-default address spaces, and for aggregate copies the middle-end should
> ensure that the copying is not done using library calls; is that the case
> and the problem was just that optab expansion was allowed for the structure
> copies and the backend decided to use libcall in that case?

Yes, the stringop selection mechanism chose libcall strategy. However, the call
to memcpy is unavailable for non-default address space, so the middle-end
expanded the call into most trivial byte-copy loop. The patch just teaches
stringop selection to use optimized copy loop as a last resort with non-default
address spaces instead.

[Bug middle-end/112581] [14 Regression] wrong code at -O2 and -O3 on x86_64-linux-gnu (generated code hangs)

2023-11-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112581

--- Comment #3 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #1)
> It might be one of the x86 specific target patches ...

I don't think so, these patches deal specifically with high registers, and:

$ grep %.h pr112581.s

finds none.

[Bug target/112567] [14 regression] ICE in RTL pass: split2: Segmentation fault

2023-11-16 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112567

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Uroš Bizjak  ---
Fixed.

[Bug target/112567] [14 regression] ICE in RTL pass: split2: Segmentation fault

2023-11-16 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112567

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-16
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #1 from Uroš Bizjak  ---
Mine, due to [1], this time I managed to split to invalid RTX...

I have a patch.

[1] https://gcc.gnu.org/pipermail/gcc-cvs/2023-November/393104.html

[Bug target/112540] [14 regression] ICE in extract_insn, at recog.cc:2804 since r14-5456-gb42a09b258c3ed

2023-11-15 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112540

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Uroš Bizjak  ---
Fixed.

[Bug target/112540] [14 regression] ICE in extract_insn, at recog.cc:2804 since r14-5456-gb42a09b258c3ed

2023-11-15 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112540

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-15
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
   Target Milestone|--- |14.0
 Ever confirmed|0   |1
   Host||x86
 Status|UNCONFIRMED |ASSIGNED

--- Comment #4 from Uroš Bizjak  ---
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636593.html

[Bug target/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

--- Comment #7 from Uroš Bizjak  ---
It looks to me that gcc_unreachable is problematic in SELECT_CC_MODE. We should
simply return CCmode for all unrecognised RTX:

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 2c80fd8ebf3..5b87361e2e1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -16469,12 +16504,9 @@ ix86_cc_mode (enum rtx_code code, rtx op0, rtx op1)
return CCNOmode;
   else
return CCGCmode;
-  /* strcmp pattern do (use flags) and combine may ask us for proper
-mode.  */
-case USE:
-  return CCmode;
 default:
-  gcc_unreachable ();
+  /* CCmode should be used in all other cases.  */
+  return CCmode;
 }
 }


Using the above patch, we can also define cmpstrnqi_1 to what it really does:

@@ -22954,9 +22958,8 @@ (define_expand "cmpstrnqi_1"
 (const_int 0))
  (compare:CC (match_operand 4 "memory_operand")
  (match_operand 5 "memory_operand"))
- (const_int 0)))
+ (reg:CC FLAGS_REG)))
  (use (match_operand:SI 3 "immediate_operand"))
- (use (reg:CC FLAGS_REG))
  (clobber (match_operand 0 "register_operand"))
  (clobber (match_operand 1 "register_operand"))
  (clobber (match_dup 2))])]

[Bug target/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

--- Comment #6 from Uroš Bizjak  ---
Now we have:

#1  0x0286a3aa in try_combine (i3=0x7fffe3c18100, i2=0x7fffe3c18000,
i1=0x0, i0=0x0, new_direct_jump_p=0x7fffd8eb, 
last_combined_insn=0x7fffe3c18100) at ../../git/gcc/gcc/combine.cc:3207
3207= SELECT_CC_MODE (compare_code, op0, op1);
(gdb) p compare_code
$1 = UNSPEC

compare_code = UNSPEC won't fly...

[Bug target/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

--- Comment #5 from Uroš Bizjak  ---
Created attachment 56567
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56567&action=edit
Proposed patch

Nope, even with the above patch the compiler ICEs at the same place:

0x1956968 ix86_cc_mode(rtx_code, rtx_def*, rtx_def*)
../../git/gcc/gcc/config/i386/i386.cc:16508
0x286a3a9 try_combine
../../git/gcc/gcc/combine.cc:3207
0x2864cbf combine_instructions
../../git/gcc/gcc/combine.cc:1264


Trying 5 -> 8:
5: r98:DI=0xd7
8: flags:CCZ=cmp(r98:DI,0)
  REG_EQUAL cmp(0xd7,0)


(insn 5 2 6 2 (set (reg/v:DI 98 [ flags ])
(const_int 215 [0xd7])) "pr112494.c":10:15 84 {*movdi_internal}
 (nil))
(insn 6 5 7 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0  S8 A8])
(const_int 215 [0xd7]))
"/hdd/uros/gcc-build-fast/gcc/include/ia32intrin.h":270:3 58 {*pushdi2_rex64}
 (nil))
(insn 7 6 8 2 (set (reg:CC 17 flags)
(unspec:CC [
(mem:DI (post_inc:DI (reg/f:DI 7 sp)) [0  S8 A8])
] UNSPEC_SET_FLAGS))
"/hdd/uros/gcc-build-fast/gcc/include/ia32intrin.h":270:3 72 {*popfldi1}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 8 7 11 2 (set (reg:CCZ 17 flags)
(compare:CCZ (reg/v:DI 98 [ flags ])
(const_int 0 [0]))) "pr112494.c":12:9 8 {*cmpdi_ccno_1}
 (expr_list:REG_EQUAL (compare:CCZ (const_int 215 [0xd7])
(const_int 0 [0]))
(nil)))
(insn 11 8 12 2 (set (mem:DI (pre_dec:DI (reg/f:DI 7 sp)) [0  S8 A8])
(unspec:DI [
(reg:CC 17 flags)
] UNSPEC_GET_FLAGS))
"/hdd/uros/gcc-build-fast/gcc/include/ia32intrin.h":262:10 70 {*pushfldi2}
 (expr_list:REG_DEAD (reg:CC 17 flags)
(nil)))

There is nothing suspicious in target code anymore (IMO, the above patch should
be applied nevertheless, the register modes are now fully correct)

[Bug target/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

Uroš Bizjak  changed:

   What|Removed |Added

  Component|rtl-optimization|target
 Status|NEW |ASSIGNED

[Bug rtl-optimization/112494] ICE in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

--- Comment #4 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #3)
> I almost want to say this is a bug in the x86 back-end where it pushes the
> flags onto the stack.

Yes, could be - let me look into this a bit more.

[Bug rtl-optimization/112494] GCC: 14: internal compiler error: in ix86_cc_mode, at config/i386/i386.cc:16477

2023-11-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112494

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-12
 Ever confirmed|0   |1
  Component|target  |rtl-optimization
 Status|UNCONFIRMED |NEW

--- Comment #1 from Uroš Bizjak  ---
Combine pass is trying to combine:

Trying 5 -> 8:
5: r98:DI=0xd7
8: flags:CCZ=cmp(r98:DI,0)
  REG_EQUAL cmp(0xd7,0)

where:

(insn 5 2 6 2 (set (reg/v:DI 98 [ flags ])
(const_int 215 [0xd7])) "pr112494.c":10:26 84 {*movdi_internal}
 (nil))

(insn 8 7 11 2 (set (reg:CCZ 17 flags)
(compare:CCZ (reg/v:DI 98 [ flags ])
(const_int 0 [0]))) "pr112494.c":12:9 8 {*cmpdi_ccno_1}
 (expr_list:REG_EQUAL (compare:CCZ (const_int 215 [0xd7])
(const_int 0 [0]))
(nil)))


and calls ix86_cc_mode with:

Breakpoint 1, ix86_cc_mode (code=code@entry=SET, op0=0x7fffe3e37680,
op1=0x7fffea209490)

code = SET will trigger gcc_unreachable() at the end of the ix86_cc_mode
function.

Confirmed as a generic RTL optimization problem.

[Bug target/110790] [14 Regression] gcc -m32 generates invalid bit test code on gmp-6.2.1

2023-11-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110790

--- Comment #9 from Uroš Bizjak  ---
(In reply to Andrew Pinski from comment #8)
> I need some code generation help for gcc.target/i386/pr110790-2.c, I have a
> patch where we now generate:
> ```
> movq(%rdi,%rax,8), %rax
> shrq%cl, %rax
> andl$1, %eax
> ```
> 
> instead of previously:
> ```
> movq(%rdi,%rax,8), %rax
> btq %rsi, %rax
> setc%al
> movzbl  %al, %eax
> ```
> 
> I suspect the sequence that contains shrq/and is better but I am 100% sure.
> We still get btq when used with a conditional too.

The new sequence is better. It does not create a partial reg write (setc needs
a clearing XOR in fron of CC-setting instruction).

[Bug target/97503] Suboptimal use of cntlzw and cntlzd

2023-11-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97503

--- Comment #7 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #6)
> (In reply to LIU Hao from comment #4)
> > Are there any reasons why this was not done for 64?
> > (https://gcc.godbolt.org/z/7vddPdxaP)
> 
> There is zero-extension from the result of __builtin_clzll that confuses
> optimizers.

Actually, sign-extension, but the result is never sign-extended.

[Bug target/97503] Suboptimal use of cntlzw and cntlzd

2023-11-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97503

--- Comment #6 from Uroš Bizjak  ---
(In reply to LIU Hao from comment #4)
> Are there any reasons why this was not done for 64?
> (https://gcc.godbolt.org/z/7vddPdxaP)

There is zero-extension from the result of __builtin_clzll that confuses
optimizers.

[Bug target/112332] [14 regression] ICE: internal compiler error: in extract_constrain_insn, at recog.cc:2705

2023-11-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112332

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Target||x86
 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |14.0

--- Comment #5 from Uroš Bizjak  ---
Fixed.

[Bug target/112332] [14 regression] ICE: internal compiler error: in extract_constrain_insn, at recog.cc:2705

2023-11-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112332

--- Comment #3 from Uroš Bizjak  ---
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 35d073c9a21..75c75f610c2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -25748,7 +25748,7 @@ (define_peephole2
  (set (match_operand:W 2 "general_reg_operand") (const_int 0))
  (clobber (reg:CC FLAGS_REG))])
(set (match_operand:SWI48 3 "general_reg_operand")
-   (match_operand:SWI48 4 "general_operand"))]
+   (match_operand:SWI48 4 "general_gr_operand"))]
   "peep2_reg_dead_p (0, operands[3])
&& peep2_reg_dead_p (1, operands[2])"
   [(parallel [(set (match_dup 0)

[Bug target/112332] [14 regression] ICE: internal compiler error: in extract_constrain_insn, at recog.cc:2705

2023-11-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112332

--- Comment #2 from Uroš Bizjak  ---
(In reply to Sergei Trofimovich from comment #1)
> Slightly shorter example:
> 
> typedef union {
>   double d;
>   int L[2];
> } U;
> void d2b(int*);
> void _Py_dg_dtoa(double dd) {
>   int be;
>   U u;
>   u.d = dd;
>   if ((&u)->L[1])
> d2b(&be);
> }

Let's put back those extran constraints...

[Bug target/110551] [11/12/13/14 Regression] an extra mov when doing 128bit multiply

2023-11-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110551

--- Comment #7 from Uroš Bizjak  ---
(In reply to CVS Commits from comment #5)
> The master branch has been updated by Roger Sayle :
> 
> https://gcc.gnu.org/g:89e5d902fc55ad375f149f25a84c516ad360a606
> 
> commit r14-4968-g89e5d902fc55ad375f149f25a84c516ad360a606
> Author: Roger Sayle 
> Date:   Fri Oct 27 10:03:53 2023 +0100

Looks like the patch regressed -march=cascadelake.

https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634660.html

[Bug target/111698] Narrow memory access of compare to byte width

2023-10-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111698

Uroš Bizjak  changed:

   What|Removed |Added

 Target|x86_64-*-*  |x86-*-*
   Target Milestone|--- |14.0
 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Uroš Bizjak  ---
Implemented for gcc-14.

[Bug target/111698] Narrow memory access of compare to byte width

2023-10-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111698

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Ever confirmed|0   |1
   Last reconfirmed||2023-10-24

--- Comment #3 from Uroš Bizjak  ---
Created attachment 56187
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56187&action=edit
Propsed patch

[Bug sanitizer/111736] New: Address sanitizer is not compatible with named address spaces

2023-10-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Bug ID: 111736
   Summary: Address sanitizer is not compatible with named address
spaces
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---

>From [1], gcc is doing a KASAN check on a percpu address (when percpu access is
implemented using named address spaces). This is not a "real" address, just an
offset from the segment register.

The testcase

--cut here--
int __seg_gs m;

int foo (void)
{
  return m;
}
--cut here--

does not show any special handling that would handle segment registers.

[1]
https://lore.kernel.org/lkml/CAHk-=wi6u-o1wdpoesuce6qo2oapu0hezaig0udou4l5cre...@mail.gmail.com/

[Bug target/111657] Memory copy with structure assignment from named address space should be improved

2023-10-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|--- |14.0
 Resolution|--- |FIXED

--- Comment #7 from Uroš Bizjak  ---
Fixed.

[Bug target/111698] Narrow memory access of compare to byte width

2023-10-05 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111698

--- Comment #2 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #1)
> I guess we could do this even on GIMPLE and in general to aligned sub-word
> accesses (where byte accesses are always aligned).
> 
> It might be also a good fit for RTL forwprop or that mem-offset pass in
> development.

I don't think this optimization should be universally enabled. According to
Agner Fog, older x86 cores suffer from store forwarding stall when smaller read
doesn't start at the same address. Intel Sandybridge and AMD Steamroller
families relaxed this constraint.

[Bug target/111698] New: Narrow memory access of compare to byte width

2023-10-04 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111698

Bug ID: 111698
   Summary: Narrow memory access of compare to byte width
   Product: gcc
   Version: 12.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Following testcase:

--cut here--
int m;

_Bool foo (void)
{
  return m & 0x0f;
}
--cut here--

compiles to:

  0:   f7 05 00 00 00 00 00testl  $0xf,0x0(%rip)
  7:   00 0f 00 

The test instruction can be demoted to byte test from addr+2.

Currently, the demotion works for lowest byte, so the testcase:

--cut here--
int m;

_Bool foo (void)
{
  return m & 0x0f;
}
--cut here--

compiles to:

   0:   f6 05 00 00 00 00 0ftestb  $0xf,0x0(%rip)

which is three bytes shorter.

Any half-way modern Intel and AMD cores will forward any fully contained load,
so there is no danger of forwarding stall with recent CPU cores.

[Bug target/111657] Memory copy with structure assignment from named address space should be improved

2023-10-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

--- Comment #5 from Uroš Bizjak  ---
I have tried to compile with -mtune=nocona that has:

static stringop_algs nocona_memcpy[2] = {
  {libcall, {{12, loop_1_byte, false}, {-1, rep_prefix_4_byte, false}}},
  {libcall, {{32, loop, false}, {2, rep_prefix_8_byte, false},
 {10, unrolled_loop, false}, {-1, libcall, false;

and compiler produces code as expected in both cases (use unrolled_loop when
rep movsq is unavailable):

foo:
movq%fs:0, %rdx
leaqt@tpoff(%rdx), %rsi
movl$30, %ecx
rep movsq
ret

bar:
xorl%edx, %edx
.L4:
movl%edx, %eax
movq%gs:s(%rax), %r9
movq%gs:s+8(%rax), %r8
movq%gs:s+16(%rax), %rsi
movq%gs:s+24(%rax), %rcx
movq%r9, (%rdi,%rax)
movq%r8, 8(%rdi,%rax)
movq%rsi, 16(%rdi,%rax)
movq%rcx, 24(%rdi,%rax)
addl$32, %edx
cmpl$224, %edx
jb  .L4
addq%rdx, %rdi
movq%gs:s(%rdx), %rax
movq%rax, (%rdi)
movq%gs:s+8(%rdx), %rax
movq%rax, 8(%rdi)
ret

[Bug target/111657] Memory copy with structure assignment from named address space should be improved

2023-10-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #4 from Uroš Bizjak  ---
Created attachment 56030
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56030&action=edit
Propsed patch

Proposed patch declares libcall algorithm unavailable to non-default address
spaces and falls back to a loop if everything else fails. The following
testcase:

--cut here--
struct a { long arr[30]; };

__thread struct a t;
void foo (struct a *dst) { *dst = t; }

__seg_gs struct a s;
void bar (struct a *dst) { *dst = s; }
--cut here--

now compiles (-O2 -mno-sse) to:

foo:
movq%fs:0, %rdx
movl$30, %ecx
leaqt@tpoff(%rdx), %rsi
rep movsq
ret

bar:
xorl%eax, %eax
.L4:
movl%eax, %edx
addl$8, %eax
movq%gs:s(%rdx), %rcx
movq%rcx, (%rdi,%rdx)
cmpl$240, %eax
jb  .L4
ret

(rep movsq copies only from the default ds: address space)

[Bug middle-end/111657] Memory copy with structure assignment from named address space is not working

2023-10-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

Uroš Bizjak  changed:

   What|Removed |Added

 Depends on||79649

--- Comment #1 from Uroš Bizjak  ---
Looks like another issue with IVopts (PR79649).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79649
[Bug 79649] Memset pattern in named address space crashes compiler or generates
wrong code

[Bug middle-end/111657] New: Memory copy with structure assignment from named address space is not working

2023-10-01 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111657

Bug ID: 111657
   Summary: Memory copy with structure assignment from named
address space is not working
   Product: gcc
   Version: 12.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

Taken from [1]. Compile the following testcase with -O2 -mno-sse:

--cut here--
struct a
{
  long arr[30];
};

__seg_gs struct a m;

void
foo (struct a *dst)
{
  *dst = m;
}
--cut here--

the produced assembly:

foo:
.LFB0:
xorl%eax, %eax
cmpq$240, %rax
jnb .L5
.L2:
movzbl  %gs:m(%rax), %edx
movb%dl, (%rdi,%rax)
addq$1, %rax
cmpq$240, %rax
jb  .L2
.L5:
ret

As rightfully said in [1]:

"...and look at the end result. It's complete and utter sh*t:

<...>

to the point that I can only go "WTF"?

I mean, it's not just that it does the copy one byte at a time. It
literally compares %rax to $240 just after it has cleared it. I look
at that code, and I go "a five-year old with a crayon could have done
better".

[1]
https://lore.kernel.org/lkml/CAHk-=wh+cfn58xxmlng6dh+eb9-2dyfabxjf2ftsz+vfqvv...@mail.gmail.com/

[Bug target/111340] gcc.dg/bitint-12.c fails on x86_64-apple-darwin or fails on x86_64-linux-gnu with -fPIE

2023-09-12 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111340

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|--- |11.5
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Uroš Bizjak  ---
Fixed.

[Bug target/111340] gcc.dg/bitint-12.c fails on x86_64-apple-darwin or fails on x86_64-linux-gnu with -fPIE

2023-09-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111340

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 CC|uros at gcc dot gnu.org|

--- Comment #5 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #4)
> Of course, what exactly falls under the "g" constraint is target specific.
> Though, because that constraint also allows the constant to be reload into a
> register,
> if such constant isn't valid, then RA should have reloaded it into register
> or memory.
> 
> Seems the failure is that i386.cc (output_pic_addr_const) doesn't have the
> CONST_WIDE_INT case unlike output_addr_const.

Indeed.  Patch in testing:

--cut here--
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1cef7ee8f1a..477e6cecc38 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -12344,8 +12344,8 @@ output_pic_addr_const (FILE *file, rtx x, int code)
   assemble_name (asm_out_file, buf);
   break;

-case CONST_INT:
-  fprintf (file, HOST_WIDE_INT_PRINT_DEC, INTVAL (x));
+CASE_CONST_SCALAR_INT:
+  output_addr_const (file, x);
   break;

 case CONST:
--cut here--

[Bug target/111165] [13 regression] builtin strchr miscompiles on Debian/x32 with dietlibc

2023-08-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65

Uroš Bizjak  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #14 from Uroš Bizjak  ---
(In reply to Thorsten Glaser from comment #13)
> The interesting part is around the occurrence of…
> 
> # eval.c:399:   sp = cstrchr(sp, '\0') + 1;
> 
> … in the .s files (it occurs thrice, the first is the beginning of the setup
> part, the second and third surround the strlen call, so they’re all within a
> bunch of lines).

Unfortunately, the runtime bug requires test that fails at runtime; the
attached dumps are not that usable. The fact that the compiler fails for not so
common target makes things even harder.

I think that the best way forward is to create a minimized standalone testcase
(From Comment #11 it looks that the issue is independent of dietlibc) that can
be compiled with -mx32 in a kind of cross-compiler fashion. You can use
-maddress-mode=long with -mx32 to create a .s assembly file that is compatible
with x86_64, as far as stack handling is concerned.

The resulting .s assembly can then be compiled and linked with a C wrapper, so
a testcase that eventually fails on x86_64 can be produced.

IOW, does the testcase fail when -maddress-mode=long is used?

[Bug target/110762] [11/12/13 Regression] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-08-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
   Target Milestone|13.3|14.0
 Resolution|--- |FIXED

--- Comment #25 from Uroš Bizjak  ---
Let's keep this patch to gcc-14+. The compiler now sanitizes every partial
vector input to potentially trapping instructions. OTOH, the patch introduced
noticeable runtime regression, so in a follow-up patch (PR110832)
-fno-trapping-math removes sanitization fixups (and the documentation documents
possible issues with assembler and builtins passing non-conformat FP values),
and -m[no-]partial-vector-fp-math option is introduced to completely disable
potentially traping instructions for partial vectors.

So, fixed for gcc-14+.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-08-25 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Uroš Bizjak  ---
Let's keep this patch to gcc-14+. The runtime regression is now due to strict
IEEE compilance, where the compiler sanitizes every partial vector input to
potentially trapping instructions. OTOH, -fno-trapping-math removes
sanitization fixups (and the documentation documents possible issues with
assembler and builtins passing non-conformat FP values), and
-m[no-]partial-vector-fp-math option is introduced to completely disable
potentially traping instructions for partial vectors.

So, fixed for gcc-14+.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-24 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Uroš Bizjak  ---
Implemented for gcc-14.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

--- Comment #7 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #6) 
> > So, the compiler still expects vec_concat/vec_select patterns to be present.
> 
> v2df foo_v2df (v2df x)
>  {
>return __builtin_shuffle (x, (v2df) { 0, 0 }, (v2di) { 0, 2 });
>  }
> 
> The testcase is not a typical vec_merge case, for vec_merge, the shuffle
> index should be {0, 3}. Here it happened to be a vec_merge because the
> second vector is all zero. And yes for this case, we still need to
> vec_concat:vec_select pattern.

I guess the original patch is the way to go then.

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #20 from Uroš Bizjak  ---
Fixed for gcc-13.3+

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

--- Comment #5 from Uroš Bizjak  ---
Created attachment 55778
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55778&action=edit
Failing patch,  for reference

Patch that converts vec_concat/vec_select sse2_movq128 patterns to vec_merge.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-23 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

--- Comment #4 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #3)
> in x86 backend expand_vec_perm_1, we always tries vec_merge frist for
> !one_operand_p, expand_vselect_vconcat is only tried when vec_merge failed
> which means we'd better to use vec_merge instead of vec_select:vec_concat
> when available in out backend pattern match.

In fact, I tried to convert existing sse2_movq128 patterns to vec_merge, but
the patch regressed:

-FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler movq
-FAIL: gcc.target/i386/sse2-pr94680-2.c scan-assembler-not pxor
-FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-not pxor
-FAIL: gcc.target/i386/sse2-pr94680.c scan-assembler-times
(?n)(?:mov|psrldq).*%xmm[0-9] 12

So, the compiler still expects vec_concat/vec_select patterns to be present.

[Bug target/94866] Failure to optimize pinsrq of 0 with index 1 into movq

2023-08-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94866

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #2 from Uroš Bizjak  ---
Created attachment 55776
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55776&action=edit
Proposed patch

Patch that introduces alternative MOVQ RTX definition.

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #17 from Uroš Bizjak  ---
(In reply to r...@cebitec.uni-bielefeld.de from comment #16)
> >> Regtested on i386-pc-solaris2.11; compiles both the reduced and the full
> >> testcase with ICE.
> >
> > *WITH* ICE?
> 
> With*out* ICE.  Sorry for being too dumb to type ;-)

Oh, thanks. I'll take care of the bug later today/tomorrow.

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-22 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

--- Comment #15 from Uroš Bizjak  ---
(In reply to r...@cebitec.uni-bielefeld.de from comment #13)
> > --- Comment #11 from Uroš Bizjak  ---
> > Created attachment 55772 [details]
> >   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55772&action=edit
> > The correct proposed patch
> >
> > Eh, sorry for wrong attachment.  This is the correct one.
> 
> Regtested on i386-pc-solaris2.11; compiles both the reduced and the full
> testcase with ICE.

*WITH* ICE?

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

--- Comment #12 from Uroš Bizjak  ---
gcc-13 version:

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5363b37d448..df476763f85 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -11527,7 +11527,8 @@ (define_insn_and_split "*concat3_3"
 {
   split_double_concat (mode, operands[0], operands[3], operands[1]);
   DONE;
-})
+}
+  [(set_attr "isa" "*,*,*,x64")])

 (define_insn_and_split "*concat3_4"
   [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,&r")
@@ -11545,7 +11546,8 @@ (define_insn_and_split "*concat3_4"
 {
   split_double_concat (mode, operands[0], operands[1], operands[2]);
   DONE;
-})
+}
+  [(set_attr "isa" "*,*,*,x64")])

 (define_insn_and_split "*concat3_5"
   [(set (match_operand:DWI 0 "nonimmediate_operand" "=r,o,o")
--cut here--

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

Uroš Bizjak  changed:

   What|Removed |Added

  Attachment #55771|0   |1
is obsolete||

--- Comment #11 from Uroš Bizjak  ---
Created attachment 55772
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55772&action=edit
The correct proposed patch

Eh, sorry for wrong attachment.  This is the correct one.

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

--- Comment #10 from Uroš Bizjak  ---
Created attachment 55771
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55771&action=edit
Proposed patch

This (untested) patch should solve the PR on trunk.

[Bug target/111010] [13/14 regression] error: unable to find a register to spill compiling GCDAProfiling.c since r13-5092-g4e0b504f26f78f

2023-08-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111010

--- Comment #9 from Uroš Bizjak  ---
(In reply to r...@cebitec.uni-bielefeld.de from comment #8)
> > --- Comment #7 from Richard Biener  ---
> >
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index f3a3305ac4f..d38b9d764d8 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -11511,7 +11511,7 @@
> >  })
> >
> >  (define_insn_and_split "*concat3_3"
> > -  [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,&r")
> > +  [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,!&r")
> > (any_or_plus:
> >   (ashift:
> > (zero_extend:
> >
> > fixes the issue for me, this disparages the &r,m,m alternative since
> > that makes any reloading difficult(?) and the early-clobber output
> > makes register pressure even harder to deal with.
> 
> On the gcc-13 branch, it does indeed, both for the reduced testcase and
> the original one.  I've also successfully regtested the patch just in
> case.

I think you should add:

(set_attr "isa" "*,*,*,x64")

attribute to hard disable 32bit targets from having two memory operands.

[Bug target/111023] missing extendv4siv4hi (and friends)

2023-08-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111023

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|ubizjak at gmail dot com   |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW
 CC||ubizjak at gmail dot com

--- Comment #7 from Uroš Bizjak  ---
The target part is now implemented (even for SSE2).

Should we keep this PR open as a tree-vectorizer enhancement?

[Bug target/111023] missing extendv4siv4hi (and friends)

2023-08-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111023

--- Comment #4 from Uroš Bizjak  ---
Created attachment 55753
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55753&action=edit
Proposed patch

Patch that implements zero/sign extend of <= 64byte vector modes to a wider
vector mode also for SSE2.

[Bug target/111023] missing extendv4siv4hi (and friends)

2023-08-18 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111023

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-08-18
 Ever confirmed|0   |1

--- Comment #3 from Uroš Bizjak  ---
The idea of implementing some sign/zero extensions using PUNPCKL?? is quite
interesting. We can implement extensions for all <= 64byte vector modes that
extend to wider vector mode also for SSE2.

I have a patch.

[Bug target/111023] missing extendv4siv4hi (and friends)

2023-08-15 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111023

--- Comment #1 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #0)
> We could vectorize gcc.dg/vect/pr65947-7.c if we implement the
> extendv4siv4hi pattern (sign-extend V4HI to V4SI).  We can already do
> vec_unpacks_lo via
> 
> pcmpgtw %xmm0, %xmm1
> movdqa  %xmm0, %xmm2
> punpcklwd   %xmm1, %xmm2
> 
> and that would trivially extend to the required pattern - just the
> input is v4hi instead of v8hi.
> 
> Other related patterns are probably missing as well, where we can do
> vec_unpack[s]_lo we should be able to implement [zero_]extend.

We have:

(define_expand "v4hiv4si2"
  [(set (match_operand:V4SI 0 "register_operand")
(any_extend:V4SI
  (match_operand:V4HI 1 "nonimmediate_operand")))]
  "TARGET_SSE4_1"

in sse.md, so the testcase should be vectorized using -msse4.1. Is there any
other pattern missing for efficient vectorization?

[Bug tree-optimization/110991] [14 Regression] Dead Code Elimination Regression at -O2 since r14-1135-gc53f51005de

2023-08-11 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110991

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2023-08-11

--- Comment #1 from Uroš Bizjak  ---
For gcc-13, fre4 pass is able to simplify the scalar code, but nothing
simplifies vectorized code in gcc-14.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-08-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

Uroš Bizjak  changed:

   What|Removed |Added

   Last reconfirmed||2023-08-09
   Keywords|needs-bisection |
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

[Bug fortran/110957] New: -ffpe-trap and -ffpe-summary options issues

2023-08-09 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110957

Bug ID: 110957
   Summary: -ffpe-trap and -ffpe-summary options issues
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ubizjak at gmail dot com
  Target Milestone: ---

A couple of issues with -ffpe-trap and -ffpe-summary options:

a) Invalid argument report should be switched:

$ gfortran -ffpe-summary=aaa ac.f90
f951: Fatal Error: Argument to ‘-ffpe-trap’ is not valid: aaa
compilation terminated.

$ gfortran -ffpe-trap=aaa ac.f90
f951: Fatal Error: Argument to ‘-ffpe-summary’ is not valid: aaa
compilation terminated.


b) Specifying also -fno-trapping-math should be detected and handled

$ gfortran -ffpe-trap=invalid -fno-trapping-math ac.f90
[no diagnostics]

The issue b) should either report incompatibility between options, or force
-ftrapping-math (probably with a warning). Ideally, -ffpe-* should always set
flag_trapping_math, in case the compiler switches to no trapping math by
default in future.

[Bug rtl-optimization/110587] [14 regression] 96% pr28071.c compile time regression since r14-2337-g37a231cc7594d1

2023-08-02 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110587

--- Comment #20 from Uroš Bizjak  ---
Can we revert the Comment #13 kludge now?

[Bug target/110762] [11/12/13 Regression] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-31 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #22 from Uroš Bizjak  ---
It looks to me that partial vector half-float instructions have the same issue.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-31 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

--- Comment #10 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #9)
> for mov_internal, we can just set alternative (v,v) with mode DI, then
> it will use vmovq, for other alternatives which set sse_regs, the
> instructions has already cleared the upper bits.
Move instructions can be sanitized in ix86_expand_vector_move. If the target is
in V2SFmode and the source is a subreg register, then movq_v2sf_to_sse should
be emitted. However, we would still like to emit MOVAPS reg, reg for V2SF to
V2SF moves, because MOVAPS may be eliminated by hardware, while MOVQ won't be.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

--- Comment #8 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #6)
> Do we know whether we could in theory improve the sanitizing by optimization
> without -funsafe-math-optimizations (I think -fno-trapping-math,
> -ffinite-math-only -fno-signalling-nans should be a better guard?)?

Regarding the sanitizing, we can remove all sanitizing MOVQ instructions
between trapping instructions (IOW, the result of ADDPS is guaranteed to have
zeros in the high part outside V2SF, so MOVQ is unnecessary in front of a
follow-up MULPS).

I think that some instruction back-walking pass on the RTL insn stream would be
able to identify these unnecessary instructions and remove them.

Also, as mentioned elsewhere, it is really hard to get non-zero value to the
highpart of XMM register. The compiler takes great care to always load values
via MOVQ, so one has to craft a special code that works around all these
fences. OTOH, in two years since gcc-11 was released with the V2SF support, not
a single PR involving spurious exceptions was reported. Even capacita benchmark
enables:

Note: The following floating-point exceptions are signalling:
IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

without problems.

As an example here, it looks that polyhedron capacita greatly benefits from
V2SF vectors, and I was surprised that sanitizing MOVQ has such an effect here.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

--- Comment #7 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #6)
> Do we know whether we could in theory improve the sanitizing by optimization
> without -funsafe-math-optimizations (I think -fno-trapping-math,
> -ffinite-math-only -fno-signalling-nans should be a better guard?)?

I was looking at -funsafe-math-optimizations because the compiler links in
crtfastmath.c which sets DAZ and FTZ flags, so eventual denormals won't bother
us. -fu-m-o also enables -fno-trapping-math, which assumes masked FP
exceptions, so we can still allow V2SF infinities and NaNs. FYI, clang enables
this optimization by default, since it defaults to -fno-trapping-math. It seems
to me that they don't care about denormals.

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

Uroš Bizjak  changed:

   What|Removed |Added

 CC||ubizjak at gmail dot com

--- Comment #5 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #3)
> Maybe r14-2786-gade30fad6669e5

Yes. This is the cost to sanitize operands before every operation.

However, we can recover the performance for -funsafe-math-optimizations with
the patch, attached to the previous message, from:

  21,592075559 seconds time elapsed
to:
  20,047717312 seconds time elapsed

[Bug middle-end/110832] [14 Regression] 14% capacita -O2 regression between g:9fdbd7d6fa5e0a76 (2023-07-26 01:45) and g:ca912a39cccdd990 (2023-07-27 03:44) on zen3 and core

2023-07-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110832

--- Comment #4 from Uroš Bizjak  ---
Created attachment 55652
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55652&action=edit
Patch to recover performance for -funsafe-math-optimizations

This patch will recover performance with -funsafe-math-optimizations.

[Bug target/110762] [11/12/13 Regression] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-28 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Uroš Bizjak  changed:

   What|Removed |Added

 CC|uros at gcc dot gnu.org|
   Target Milestone|11.5|13.3

--- Comment #21 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #20)
> Thanks a lot.  So this should now be fully fixed in GCC 14.  The original
> testcase is also broken in GCC 11, 12 and 13 but not 10, but I'm not sure
> how far we'd want to backport this change - I'd consider the 13 branch but
> that's probably it.  After some time soaking, that is.

The issue can be triggered only with a specially crafted code (such as the one
in Comment #0 / Comment #12) that deliberatelly exposes the problem. Otherwise,
the approach from PR 95046 is quite robust, and there have been no PRs in this
area reported, although V2SF is auto-vectorized by default.

The patch is written in such a way to minimize exposure to subregs (the
temporary V4SFmode output register is used and later copied via subreg to
target V2SFmode operand) to avoid eventual problems in RA. GCC 13.2 was just
released, so I think the patch could be backported to gcc-13 branch in the
first week of august, but as you propose, only to gcc-13 branch, and not any
further.

[Bug rtl-optimization/91838] [8/9 Regression] incorrect use of shr and shrx to shift by 64, missed optimization of vector shift

2023-07-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91838

--- Comment #18 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #17)
> Interestingly even with -mno-sse we somehow have a shift for V2QImode.
This is implemented by a combination of shl rl,cl and shl rh,cl, so no XMM
registers are needed.

[Bug target/110788] Spilling to mask register for GPR vec_duplicate

2023-07-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110788

--- Comment #3 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #0)
> I suppose it could also be a missed optimization in REE since I think
> the HImode regs should already be zero-extended?
No, only SImode moves have implicit zero extensions. Plain HImode and QImode
moves behave as inserts into the lowpart of the wide register.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #18 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #17)
> > compiles to:
> > 
> > movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
> > movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
> > movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
> > mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
> > movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
> 
> so this one is obviously redundant - I suppose at the RTL level we have
> no chance of noticing this.  I hope for integer vector operations we
> avoid these ops?  I think this will make epilog vectorization with V2SFmode
> a bad idea, we'd need to appropriately disqualify this in the costing
> hooks.

Yes, the redundant movq is emitted only in front of V2SFmode trapping
operations. So, all integer, V2SF logic and swizzling operations are still
implemented directly with "emulated" instructions.
> 
> I wonder if combine could for example combine a v2sf load with the
> upper half zeroing for the next use?  Likewise for arithmetics.

The patch already does that. We know that V2SF load zeroes the upper half, so
there is no additional MOVQ emitted. To illustrate, the testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf m;

v2sf test (v2sf a)
{
  return a - m;
}
--cut here--

compiles to:

movqm(%rip), %xmm1  # 6 [c=4 l=8]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 7 [c=4 l=4]  *vec_concatv4sf_0
subps   %xmm1, %xmm0# 8 [c=12 l=3]  *subv4sf3/0

As far as arithmetic is concerned, perhaps some back-walking RTL optimization
pass can figure out that the preceding trapping V2SFmode operation guarantees
zeros in the upper half and remove clearing insn. However, MOVQ xmm,xmm is an
extremely fast instruction with latency of 1 and reciprocal throughput of 0.33,
so I guess it is not of much concern.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-26 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |ASSIGNED

--- Comment #16 from Uroš Bizjak  ---
Created attachment 55636
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55636&action=edit
Proposed patch

Proposed patch clears the upper half of a V4SFmode operand register before all
potentially trapping instructions. The testcase from comment #12 now compiles
to:

movq%xmm1, %xmm1# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 10[c=4 l=4]  *vec_concatv4sf_0
addps   %xmm1, %xmm0# 11[c=12 l=3]  *addv4sf3/0

This approach addresses issues with traps (Comment #0), as well as with
denormal/invalid values (Comment #14). An obvious exception to the rule is a
division, where the value != 0.0 should be loaded into the upper half of the
denominator.

The patch effectively tightens the solution from PR95046 by clearing upper
halves of all operand registers before every potentially trapping instruction.
The testcase:

--cut here--
typedef float __attribute__((vector_size(8))) v2sf;

v2sf test (v2sf a, v2sf b, v2sf c)
{
  return a * b - c;
}
--cut here--

compiles to:

movq%xmm1, %xmm1# 8 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm0, %xmm0# 9 [c=4 l=4]  *vec_concatv4sf_0
movq%xmm2, %xmm2# 12[c=4 l=4]  *vec_concatv4sf_0
mulps   %xmm1, %xmm0# 10[c=16 l=3]  *mulv4sf3/0
movq%xmm0, %xmm0# 13[c=4 l=4]  *vec_concatv4sf_0
subps   %xmm2, %xmm0# 14[c=12 l=3]  *subv4sf3/0

The implementation simply calls V4SFmode operation, so we can remove all
"emulated" SSE2 V2SFmode instructions and SSE2 V2SFmode alternatives from
3dNOW! insn patterns.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #13 from Uroš Bizjak  ---
I think we should put all partial vector V2SF operations under
!flag_trapping_math.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #10 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #7)
> I guess for the specific usage we need to wrap this in an UNSPEC?

Probably, so a MOVQ xmm, xmm insn should be emitted for __builtin_ia32_storelps
(AKA _mm_storel_pi), so the top 64bits will be cleared. There is already
*vec_concatv4sf_0 that looks appropriate to implement the move.

[Bug target/110762] inappropriate use of SSE (or AVX) insns for v2sf mode operations

2023-07-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110762

--- Comment #3 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #1)
> So what's the issue?  That this is wrong for -ftrapping-math?  Or that the
> return value has undefined contents in the upper half?  (I don't think the
> ABI specifies how V2SF is returned)

__m64 is classified as SSE class, returned in XMM register.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-20 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|ubizjak at gmail dot com   |unassigned at gcc dot 
gnu.org
 Status|ASSIGNED|NEW

--- Comment #8 from Uroš Bizjak  ---
(In reply to CVS Commits from comment #7)
> The master branch has been updated by Uros Bizjak :

The patch implements transform for x86 targets only. Due to eventual STV
transformation, x86 targets handle double-word operations in its own way.

I'll left the target-independent implementation to someone else.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 CC|uros at gcc dot gnu.org|

--- Comment #6 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #5)
> Thanks.
> Shouldn't
> INTVAL (operands[2]) <  * BITS_PER_UNIT
> be
> UINTVAL (operands[2]) <  * BITS_PER_UNIT
> just to make sure it doesn't trigger for negative?

Ah, yes, I'll change it.

[Bug rtl-optimization/110717] Double-word sign-extension missed-optimization

2023-07-19 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110717

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com

--- Comment #4 from Uroš Bizjak  ---
Created attachment 55578
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55578&action=edit
Proposed patch

Patch in testing.

[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246

2023-07-14 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

Uroš Bizjak  changed:

   What|Removed |Added

 Resolution|--- |FIXED
   Target Milestone|14.0|12.4
 Status|ASSIGNED|RESOLVED

--- Comment #20 from Uroš Bizjak  ---
Fixed for gcc-12.4+.

[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246

2023-07-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

--- Comment #16 from Uroš Bizjak  ---
v2 patch at [1].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624491.html

[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246

2023-07-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

--- Comment #15 from Uroš Bizjak  ---
Created attachment 55537
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55537&action=edit
Proposed patch.

v2 patch in testing.

This version prevents emission of invalid REG_EQUAL note in
cprop.cc/try_replace_reg when original, non-simplified RTX contains SUBREG. The
patch is in effect an one-liner:

@@ -795,7 +796,8 @@ try_replace_reg (rtx from, rtx to, rtx_insn *insn)
   /* If we've failed perform the replacement, have a single SET to
 a REG destination and don't yet have a note, add a REG_EQUAL note
 to not lose information.  */
-  if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set)))
+  if (!success && note == 0 && set != 0 && REG_P (SET_DEST (set))
+ && !contains_paradoxical_subreg_p (SET_SRC (set)))
note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (src));
 }

but we have to move contains_paradoxical_subreg_p to rtlanal.cc.

[Bug target/106966] [12/13/14 Regression] alpha cross build crashes gcc-12 "internal compiler error: in emit_move_insn"

2023-07-13 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106966

Uroš Bizjak  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ubizjak at gmail dot com
 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from Uroš Bizjak  ---
Thanks for helping with tests!

Fixed for gcc-12.4+

[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246

2023-07-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

--- Comment #14 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #10)
> (In reply to Uroš Bizjak from comment #9)
> > and simplify_replace_rtx simplifies the above to:
> > 
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x8
> > ])
> 
> Patched compiler simplifies to:
> 
> (gdb) p debug_rtx (src)
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x4
> (const_int 0 [0]) repeated x4
> ])

The patched compiler puts the above in REG_EQUAL note. While the value is "more
correct", I don't think the compiler has the right to set REG_EQUAL note when
the top 4 bytes are actually undefined (as a result of an operation with an
undefined input, which is the case with paradoxical subreg).

[Bug rtl-optimization/110206] [14 Regression] wrong code with -Os -march=cascadelake since r14-1246

2023-07-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110206

--- Comment #13 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #12)
> I can see cprop1 adds the REG_EQUAL note:
> 
> (insn 22 21 23 4 (set (reg:V8HI 100)
> (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
> (parallel [
> (const_int 0 [0])
> (const_int 1 [0x1])
> (const_int 2 [0x2])
> (const_int 3 [0x3])
> (const_int 4 [0x4])
> (const_int 5 [0x5])
>  (const_int 6 [0x6])
>  (const_int 7 [0x7])
>  ] "t.c":12:42 7557 {sse4_1_zero_extendv8qiv8hi2}
> - (expr_list:REG_DEAD (reg:V4QI 98)
> -(nil)))
> + (expr_list:REG_EQUAL (const_vector:V8HI [
> +(const_int 204 [0xcc]) repeated x8
> +])
> +(expr_list:REG_DEAD (reg:V4QI 98)
> +(nil
> 
> but I don't see yet what the actual wrong transform based on this REG_EQUAL
> note is?

We constant fold V4QImode const_vector to a V8HImode const_vector with 8
defined elements. We started with undefined top four bytes, but now we
magically define them.

> 
> It looks like we CSE the above with
> 
> -   46: r122:V8QI=[`*.LC3']
> -  REG_EQUAL const_vector
> -   48: r125:V8HI=zero_extend(vec_select(r122:V8QI#0,parallel))
> -  REG_EQUAL const_vector
> -  REG_DEAD r122:V8QI
> -   49: r126:V8HI=r124:V8HI*r125:V8HI
> -  REG_DEAD r125:V8HI
> +   49: r126:V8HI=r124:V8HI*r100:V8HI
> 
> but otherwise do nothing.  So the issue is that we rely on the "undefined"
> vals to have a specific value (from the earlier REG_EQUAL note) but actual
> code generation doesn't ensure this (it doesn't need to).  That said,
> the issue isn't the constant folding per-se but that we do not actually
> constant fold but register an equality that doesn't hold.

The above CSE is the consequence of REG_EQUAL note that compiler set on the
insn. Compiler claims that the value of (insn 22) equals an array of 8 consts {
204 , ... , 204 }, but in reality (c.f. Comment #3) the value in the register
%xmm4 before VPMULLW insn is { 0, 0, 0, 0, 204, 204, 204, 204 }.

<    1   2   3   4   5   6   7   8   9   10   >