[Bug rtl-optimization/110254] improve_allocation() routine does not update allocated_hardreg_p[] array

2024-08-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110254

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Surya Kumari Jangala  ---
Fixed in r14-3251-g02ecc9a26324d1

[Bug target/96017] Powerpc suboptimal register spill in likely path

2024-08-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #15 from Surya Kumari Jangala  ---
With r15-1619-g3b9b8d6cfdf593 , this testcase gets shrink wrapped. Hence
closing the bug.

[Bug rtl-optimization/111673] assign_hard_reg() routine should scale save/restore costs of callee save registers with basic block frequency

2024-08-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111673

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from Surya Kumari Jangala  ---
Fixed in r15-1619-g3b9b8d6cfdf593

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-08-08 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from Surya Kumari Jangala  ---
Fixed in r15-2810-g3c67a0fa1dd39a

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-08-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

--- Comment #7 from Surya Kumari Jangala  ---
I have posted a patch for the fix.

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-07-24 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

--- Comment #6 from Surya Kumari Jangala  ---
The test gcc.dg/ira-shrinkwrap-prep-1.c has been marked as an XFAIL on powerpc.
The test fails as shrink wrapping does not happen as expected. The reason that
the shrink wrapping doesn't happen is the same as described in comment 2.

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-07-23 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

--- Comment #4 from Surya Kumari Jangala  ---
(In reply to Sam James from comment #1)
> Yeah, I mentioned it when filing PR115673, but I wasn't sure if they were
> all the same cause so didn't want to file a bunch without knowing.

I am not able to reproduce the failure on x86_64.

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-07-23 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

Surya Kumari Jangala  changed:

   What|Removed |Added

 Target|aarch64-*-* |aarch64-*-* powerpc*-*-*

--- Comment #3 from Surya Kumari Jangala  ---
On Power, the test gcc.dg/pr10474.c  has been marked as an XFAIL.
I analyzed the testcase on power and the reason the shrink wrapping doesn't
happen is the same as described in comment 2. The parameter register is saved
in a volatile register which is saved on stack in the entry bb. Instead, if the
volatile register is saved just before the call, then shrink wrapping will
happen.

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure since r15-1619-g3b9b8d6cfdf593

2024-07-23 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

Surya Kumari Jangala  changed:

   What|Removed |Added

 Target||aarch64-*-*

--- Comment #2 from Surya Kumari Jangala  ---
Testcase pr10474.c:

void f(int *i)
{
if (!i)
return;
else
{
__builtin_printf("Hi");
*i=0;
}
}

--
On aarch64:

Assembly w/o patch at r15-1619-g3b9b8d6cfdf593:
cbz x0, .L7
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
mov x19, x0
adrpx0, .LC0
add x0, x0, :lo12:.LC0
bl  printf
str wzr, [x19]
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret
.L7:
ret

---

Assembly w/ patch:
stp x29, x30, [sp, -32]!
mov x29, sp
str x0, [sp, 24]
cbz x0, .L1
adrpx0, .LC0
add x0, x0, :lo12:.LC0
bl  printf
ldr x1, [sp, 24]
str wzr, [x1]
.L1:
ldp x29, x30, [sp], 32
ret


As we can see above, w/o patch the test case gets shrink wrapped.

Input RTL to the LRA pass (the RTL is same both w/ and w/o patch):

BB2:
  set r95, x0
  set r92, r95
  if (r92 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(r92), 0
BB4:
  ret


Register assignment by IRA:
w/o patch:
  r92-->x19
  r95-->x0
  r94-->x0

w/ patch:
  r92-->x1
  r95-->x0
  r94-->x0


RTL after LRA:

w/o patch:
BB2:
  set x19, x0
  if (x19 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(x19), 0
BB4:
  ret


w/ patch:
BB2:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set x1, mem(sp+24)
  set mem(x1), 0
BB4:
  ret


The difference between w/o patch and w/ patch is that w/o patch, a callee-save
register (x19) is chosen to hold the value of x0 (input parameter register).
While
w/ patch, a caller-save register (x1) is chosen.

W/o patch, during the shrink wrap pass, first copy propagation is done and
the 'if' insn in BB2 is changed as follows:
  set x19, x0
  if (x19 eq 0) jump BB4

changed to:
  set x19, x0
  if (x0 eq 0) jump BB4   

Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a
callee-save register, prolog gets generated in BB3 thereby resulting in
successful shrink wrapping.

W/ patch, during the shrink wrap pass, copy propagation changes BB2 as follows:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4

changed to:
  set x1, x0
  set mem(sp+24), x0
  if (x0 eq 0) jump BB4

However the store insn (set mem[sp+24], x0) cannot be moved down to BB3.
hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby
shrink wrap fails.

The store insn (which basically saves x1 to stack) is generated by the
LRA pass. This insn is needed because x1 is a caller-save register and we
have a call insn that will clobber this register. However, the store insn is
generated
in the entry BB (BB2) instead of in BB3 which has the call insn. If the store
is generated in BB3, then the testcase will be shrink wrapped successfully.
In fact, it is more efficient if the store occurs only in the path containing
the printf call instead of occurring in the entry bb.

The reason why LRA generates the store insn in the entry bb is as follows:
LRA emits insns to save caller-save registers in the inheritance/splitting
pass.
In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
the EBBs in reverse order from the last insn to the first insn. When LRA sees a
write to a pseudo (that has been assigned a caller-save register), and there is
a
read following the write, with an intervening call insn between the write and
read,
then LRA generates a spill immediately after the write and a restore
immediately
before the read. The spill is needed because the call insn will clobber the
caller-save register.

In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3
while
the second EBB contains BB4. 

In BB2, there is a write to x1 in the insn : 
set r92, r95 //r92 is assigned x1 and r95 is assigned x0

In BB3, there is a read of x1 after the call
insn.
set mem(r92), 0   // r92 is assigned x1

So LRA generates a spill in BB2 after the write to x1.

The fix to this issue would involve making changes in
LRA to save caller-save registers before a call instead of after the write to
the
caller-save register.

[Bug rtl-optimization/116028] [15 regression] gcc.dg/pr10474.c test failure

2024-07-22 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

Surya Kumari Jangala  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-22
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

[Bug rtl-optimization/116028] New: [15 regression] gcc.dg/pr10474.c test failure

2024-07-22 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116028

Bug ID: 116028
   Summary: [15 regression] gcc.dg/pr10474.c test failure
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

The test fails with r15-1619-g3b9b8d6cfdf593.

FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing
shrink-wrapping"

The testcase fails to shrink wrap and hence the failure.

[Bug testsuite/115894] [15 regression] gcc.target/arm/pr111235.c test failure

2024-07-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115894

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from Surya Kumari Jangala  ---
Fixed in r15-2036-g60ba989220

[Bug testsuite/115892] [15 regression] gcc.target/aarch64/sve/acle/general/cpy_1.c test failure

2024-07-14 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115892

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from Surya Kumari Jangala  ---
Fixed in r15-2034-g8b1492012e5a11

[Bug rtl-optimization/115894] [15 regression] gcc.target/arm/pr111235.c test failure

2024-07-12 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115894

Surya Kumari Jangala  changed:

   What|Removed |Added

   Target Milestone|--- |15.0
 CC||bergner at gcc dot gnu.org
   Keywords||testsuite-fail
 Target||arm

[Bug rtl-optimization/115894] [15 regression] gcc.target/arm/pr111235.c test failure

2024-07-12 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115894

Surya Kumari Jangala  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-12
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |jskumari at gcc dot 
gnu.org

[Bug rtl-optimization/115894] New: [15 regression] gcc.target/arm/pr111235.c test failure

2024-07-12 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115894

Bug ID: 115894
   Summary: [15 regression] gcc.target/arm/pr111235.c test failure
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

The test fails with r15-1619-g3b9b8d6cfdf593.

gcc.target/arm/pr111235.c: ldrexd\tr[0-9]+, r[0-9]+, \\[r[0-9]+\\] found 1
times
FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+,
\\[r[0-9]+\\] 2

The test fails due to different registers used in ldrexd instruction. The key
part of this test is that the compiler generates LDREXD. The 
registers used for that are pretty much irrelevant as they are not matched with
any other operations within the test.

As per the discussion at
https://gcc.gnu.org/pipermail/gcc/2023-November/242823.html and
https://gcc.gnu.org/pipermail/gcc/2023-November/242924.html the testcase can be
modified to avoid the failure.

[Bug rtl-optimization/115892] [15 regression] gcc.target/aarch64/sve/acle/general/cpy_1.c test failure

2024-07-12 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115892

Surya Kumari Jangala  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-12

[Bug rtl-optimization/115892] New: [15 regression] gcc.target/aarch64/sve/acle/general/cpy_1.c test failure

2024-07-12 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115892

Bug ID: 115892
   Summary: [15 regression]
gcc.target/aarch64/sve/acle/general/cpy_1.c test
failure
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

The test fails with r15-1619-g3b9b8d6cfdf593. 

body: .*\tadd   (x[0-9]+), x0, #?1
\tmov   (p[0-7])\.b, p15\.b
\tmov   z0\.d, \2/m, \1
.*\tret

against:addvl   sp, sp, #-1
str p15, [sp]
mov p3.b, p15.b
add x0, x0, 1
mov z0.d, p3/m, x0
ldr p15, [sp]
addvl   sp, sp, #1
ret

FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve
-moverride=tune=none  check-function-bodies dup_x0_m

As per the discussion at
https://gcc.gnu.org/pipermail/gcc/2023-November/242930.html and
https://gcc.gnu.org/pipermail/gcc/2023-November/242931.html , this is an
acceptable result and the testcase can be changed.

[Bug target/114004] GCC emits a superfluous instruction for simple test case on ppc

2024-02-27 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114004

Surya Kumari Jangala  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug rtl-optimization/110071] improve_allocation() routine should consider save/restore cost of callee-save registers

2024-02-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110071

Surya Kumari Jangala  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Surya Kumari Jangala  ---
Fixed by the commit in comment 1.

[Bug rtl-optimization/110071] improve_allocation() routine should consider save/restore cost of callee-save registers

2024-02-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110071

Surya Kumari Jangala  changed:

   What|Removed |Added

   Last reconfirmed||2024-02-01
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

[Bug target/96017] Powerpc suboptimal register spill in likely path

2023-11-24 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017

--- Comment #14 from Surya Kumari Jangala  ---
Instead of using a non-volatile register to hold the value of foo, a volatile
register (r9) is assigned to hold foo. This avoids setting up the stack frame
in the fast path.

[Bug target/96017] Powerpc suboptimal register spill in likely path

2023-11-24 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017

--- Comment #13 from Surya Kumari Jangala  ---
With the patch at
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html, the
testcase gets shrink wrapped. This is the assembly produced:

addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
lwz %r10,0(%r3)
addis %r9,%r2,.LC0@toc@ha
ld %r9,.LC0@toc@l(%r9)
cmpwi %cr0,%r10,0
lwz %r9,0(%r9)
bne %cr0,.L11
extsw %r3,%r9
blr
.L11:
mflr %r0
std %r0,16(%r1)
stdu %r1,-48(%r1)
stw %r9,32(%r1)
bl slowpath
nop
lwz %r9,32(%r1)
addi %r1,%r1,48
ld %r0,16(%r1)
extsw %r3,%r9
mtlr %r0
blr

[Bug rtl-optimization/111673] New: assign_hard_reg() routine should scale save/restore costs of callee save registers with basic block frequency

2023-10-03 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111673

Bug ID: 111673
   Summary: assign_hard_reg() routine should scale save/restore
costs of callee save registers with basic block
frequency
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

In assign_hard_reg(), when computing the costs of the hard registers, the cost
of saving/restoring a callee-save hard register in epilogue/prologue is taken
into consideration. However, this cost is not scaled with the entry block
frequency. Without scaling, the cost of saving/restoring is quite small and
this can result in a callee-save register being chosen by assign_hard_reg()
even though there are free caller-save(volatile) registers available. 

Consider the following test:

int f(int);

int advance(int dz)
{
if (dz > 0)
return (dz + dz) * dz;
else
return dz * f(dz);
}


Input RTL to IRA pass:

  set r127, r3
  set r121, r127
  set r122, compare(r121, 0)
  if (r122 le 0) jump BB4 else jump BB3

BB3:
  set r123, r121*r121
  set r119, r123<<1
  jump BB5

BB4:
  set r3, call f(r3)
  set r128, r3
  set r119, r128*r121

BB5:
  set r3, r119
  return r3


When assign_hard_reg() is called for allocno r121, the cost for r31 is 0
(obtained from ALLOCNO_UPDATED_HARD_REG_COSTS). Since r31 on PowerPC is a
callee save register, we compute the cost for saving/restoring r31 in
prolog/epilog and this cost is 7. So the final cost for r31 is 7. And r31 is
assigned to allocno r121 since it has the lowest cost among the profitable
registers.
However, among profitable registers for allocno r121, there are caller save
registers (like r9) that could possibly be assigned to allocno r121. r9 has a
cost of 2040 (obtained from ALLOCNO_UPDATED_HARD_REG_COSTS). So it is not
chosen as cost of r31 is lesser.
But computation of save/restore costs for r31 is incorrect as it doesn’t take
into consideration the frequency of the basic blocks in which the save/restore
instructions will be placed. If the frequency is taken into consideration, then
cost of r31 is 7000 (frequency of entry bb is 1000). And this would result in
r9 being assigned to r121.

Since r31 is assigned to allocno r121, the above test does not get shrink
wrapped.

[Bug target/103784] suboptimal code for returning bool value on target ppc

2023-07-20 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #15 from Surya Kumari Jangala  ---
This is another test which has unnecessary zero extension:

#include 

bool glob1;
bool glob2;

bool foo (int a, bool d)
{
  bool c;
  if (a > 2)
c = glob1 & glob2;
  else
c = glob1 | glob2;
  return c^d;
}


I am not sure if this is handled by the patch at
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624751.html as the RTL for
this test has a different CFG shape than what is mentioned in the patch.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-06-27 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #12 from Surya Kumari Jangala  ---
(In reply to Peter Bergner from comment #10)
> (In reply to Peter Bergner from comment #9)
> > Yes, you'll need to factor in the BB frequency.  Since the save/restore code
> > will go into (at this point modulo shrink-wrapping later) the prologue and
> > epilogue, you'll want something like:  * 2 *
> > "ira_memory_move_cost".
> 
> Thinking more, depending on the allocno/mode, you might also need to factor
> in calculate_saved_nregs(), in case the allocno represents a register pair
> or larger.

Yes, I am taking calculate_saved_nregs() into consideration.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-06-27 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #11 from Surya Kumari Jangala  ---
(In reply to Peter Bergner from comment #9)
> (In reply to Surya Kumari Jangala from comment #8)
> > However, while computing the save/restore cost, we are considering only the
> > memory move cost but not the BB frequency. I think it is important to
> > consider the frequency too. 
> 
> Yes, you'll need to factor in the BB frequency.  Since the save/restore code
> will go into (at this point modulo shrink-wrapping later) the prologue and
> epilogue, you'll want something like:  * 2 *
> "ira_memory_move_cost".

Thanks for confirming that we have to factor in the BB frequency. 

> I think the issue here, is that the "frequency" of the entry block isn't
> '1', but some larger value.  I'm not 100% sure, but maybe you can use:
>   REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun))

This works. It gives the frequency of the entry block.

> for the BB frequency of the prologue/epilogue?
> 
> 
> > We factor in the frequency when we compute the
> > savings on removed copies in allocno_copy_cost_saving(). In this routine, we
> > multiply the frequency with ira_register_move_cost. So why not factor in the
> > frequency for memory move cost?
> 
> allocno_copy_cost_saving() is dealing with an actual copy instruction(s), so
> a real instruction in a specific BB, so it knows the frequency of the
> copy(ies).  The ira_memory_move_cost is more of a HW cost of a generic
> load/store and it isn't tied to a specific instruction, so there is no
> frequency to scale by, so you'll need to do that manually here.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-06-23 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #8 from Surya Kumari Jangala  ---
(In reply to Surya Kumari Jangala from comment #7)
> There are a couple of issues in IRA:
> 
> 1. In improve_allocation() routine, we are not considering save/restore cost
> of using a callee save register (r31 in the failing case). Due to this, r31
> is being chosen instead of r3 for allocno r118.
> 

I made changes in the improve_allocation() routine to consider the save/restore
cost, but this still results in r31 being chosen for allocno r118.

Save/restore cost is computed by using the costs in the "ira_memory_move_cost"
array. But the memory move costs are quite small, it is 4. So even adding this
to r31 does not make the cost of r31 large enough for it to be not chosen.

However, while computing the save/restore cost, we are considering only the
memory move cost but not the BB frequency. I think it is important to consider
the frequency too. We factor in the frequency when we compute the savings on
removed copies in allocno_copy_cost_saving(). In this routine, we multiply the
frequency with ira_register_move_cost. So why not factor in the frequency for
memory move cost?

For the testcase we are considering:

BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118+1
  return r3

In improve_allocation(), we compute the cost improvement of each hard register
for allocno r118. And for each hard register we compute the
allocno_copy_cost_saving() if that hard reg is assigned to r118. In
allocno_copy_cost_saving(), we check if there are copies involving r118. And we
have one copy:
set r118, r122

Since r122 is a copy of r3, so the copy cost saving if r3 is assigned to r118
is:
(freq of the copy insn) * ira_register_move_cost

Here, freq of the copy insn is taken as 1000 and ira_register_move_cost is 2.
So the copy cost saving is 2000. Note that BB2 in which the copy insn occurs is
the first BB.

The save insn for r31 will be placed in the prolog, so the freq of prolog needs
to be taken into consideration.

[Bug rtl-optimization/110254] New: improve_allocation() routine does not update allocated_hardreg_p[] array

2023-06-14 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110254

Bug ID: 110254
   Summary: improve_allocation() routine does not update
allocated_hardreg_p[] array
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

"allocated_hardreg_p[]" is a boolean array whose element value is TRUE is the
corresponding hard register was already allocated for an allocno.
This array is used in calculate_saved_nregs().

The improve_allocation() function improves the allocation by spilling some
allocnos and assigning the freed hard registers to other allocnos if it
decreases the overall allocation cost.

If the register chosen in improve_allocation() is one that already has been
assigned to a conflicting allocno, then allocated_hardreg_p[] already has the
corresponding bit set to TRUE, so nothing needs to be done.

But improve_allocation() can also choose a register that has not been assigned
to a conflicting allocno, and also has not been assigned to any other allocno.
In this case, allocated_hardreg_p[] has to be updated. improve_allocation()
calls assign_hard_reg() to check if any of the spilled allocnos can get hard
registers.
And assign_hard_reg() calls calculate_saved_nregs() which uses the array.
Hence, the array needs to be updated.

[Bug rtl-optimization/110071] New: improve_allocation() routine should consider save/restore cost of callee-save registers

2023-06-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110071

Bug ID: 110071
   Summary: improve_allocation() routine should consider
save/restore cost of callee-save registers
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

For the following test:

long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i+1;
}

Input RTL to IRA:

BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118+1
  return r3

IRA assigns r31 to r118.
Since r31 is a callee save register in powerpc, we need to generate
spill/restore code.

This assignment of r31 to r118 causes shrink wrap to fail for this test.
Since r31 is assigned to r118, BB2 requires a prolog and shrink wrap fails.

In the IRA pass, after graph coloring, r118 gets assigned to r3.
The routine improve_allocation() is called after graph coloring. This routine
changes the assignment of r118 to r31.

In improve_allocation() routine, IRA checks for each allocno if spilling any
conflicting allocnos can improve the allocation of this allocno. This routine
computes the cost improvement for usage of each profitable hard register for a
given allocno.

The existing code in improve_allocation() does not consider the save/restore
costs of callee save registers while computing the cost improvement.

This bug is for adding save/restore costs while computing cost improvement.

Save/restore costs should be considered only for the first assignment of a
callee save register. Subsequent assignments of the same register do not need
to consider this cost.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-05-11 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #7 from Surya Kumari Jangala  ---
There are a couple of issues in IRA:

1. In improve_allocation() routine, we are not considering save/restore cost of
using a callee save register (r31 in the failing case). Due to this, r31 is
being chosen instead of r3 for allocno r118.

2. In find_costs_and_classes(), we are not computing the cost of register moves
when we have a 'set' insn that copies from one pseudo reg to another pseudo
reg. This is resulting in r118 having ALLOCNO_CLASS_COST of 0.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-05-10 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #6 from Surya Kumari Jangala  ---
Continuing with the analysis of the test cases specified in comment 5, here are
some findings:

After graph colouring, when we do improve_allocation(), we find that in the
failing test case, the hard_reg_cost[r31] for allocno r118 is 0. (Note: not
just r31, for several other registers the cost is 0). The cost for r3 is 2390.

I spent some time investigating why the hard_reg_cost is 0. In the failing
test, the hard_reg_cost[] array is initialized to ALLOCNO_CLASS_COST in the
routine setup_allocno_class_and_costs(). ALLOCNO_CLASS_COST for allocno r118 is
0.

In the passing test, the hard_reg_cost[] array is initialized to
ALLOCNO_CLASS_COST in the routine process_bb_node_for_hard_reg_moves().
ALLOCNO_CLASS_COST is 2000 for allocno r117.

The reason the initialization happens in a different routine in the passing and
failing tests is because the preferred register class is different from the
allocno class in the failing case (the former is BASE_REGS while the latter is
GENERAL_REGS), while in the passing case both are same (GENERAL_REGS).

ALLOCNO_CLASS_COST is minimal accumulated register class cost of the allocno
class.

In the failing case, while the initial value of hard_reg_cost[r3] is 0, it is
changed in the following routines:
1. ira_tune_allocno_costs() : 0->2640
 (this routine checks if the register is caller save and if it is live across a
call. r3 is caller save while r31 is not. So, only cost for r3 is changed.)
2. process_regs_for_copy() : 2640->2390
 (this routine processes the 'set r3, r118+1' insn and reduces the cost of r3
in order to make it more preferential for r118. Again, cost of r31 is not
changed as there is no insn involving r31 & r118)

Even though r31 has 0 cost, r3 is chosen to be assigned to r118 because r31 is
a callee save register and it's use will have save/restore cost.

In the passing case, both r31 and r3 initially have a cost of 2000. 

The cost for r3 is then changed in the following routines:
1. process_bb_node_for_hard_reg_moves() : 2000->0
 (the cost is reduced in order to give preference to r3 for allocno r117.
This is due to the 'set r3, r117' insn)
2. ira_tune_allocno_costs () : 0->2640
3. process_regs_for_copy() : 2640->640

r3 is chosen for allocno r117 as it has lesser cost than r31.

Next, I checked why ALLOCNO_CLASS_COST is 0 in the failing case while in the
passing case it is 2000.

In the routine find_costs_and_classes(), we compute the cost of each register
class for each allocno. First we go thru all the insns in the routine. Then for
each insn which involves a copy from/to a hard reg to/from a pseudo reg, we
compute the cost of the 'move' if the pseudo is assigned a register from a
specific register class. This cost is the register class cost of this allocno
for this specific insn. We add up all such costs to get the register class cost
for an allocno.

In the passing case, we have the insn 'set r3, r117' which is processed in
find_costs_and_classes(). For this insn, we check the cost of the move if r117
is assigned to different register classes. The minimum of these costs is the
ALLOCNO_CLASS_COST for r117.

But in the failing case, there is no 'set' insn involving a hard register and
the pseudo register r118. So the ALLOCNO_CLASS_COST is 0.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-04-14 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #5 from Surya Kumari Jangala  ---
I was analysing and comparing the following test cases:

Test1 (shrink wrapped)

long
foo (long i, long cond)
{
  i = i + 1;
  if (cond)
bar ();
  return i;
}


Test2 (not shrink wrapped)

long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i+1;
}


There is a difference in register allocation by IRA in the two cases.

Input RTL to IRA (Test1: passing case)
BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r117, r122 + 1
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r117
  return r3


Input RTL to IRA (Test2: failing case)

BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118+1
  return r3


There is a difference in registers allocated for r117 (passing case) and r118
(failing case) by IRA.
r117 is allocated r3 while r118 is allocated r31.
Since r117 is allocated r3, r3 is spilled across the call to bar() by LRA. And
so only BB3 requires a prolog and shrink wrap is successful.
In the failing case, since r31 is assigned to r118, BB2 requires a prolog and
shrink wrap fails.

In the IRA pass, after graph coloring, both r117 and r118 get assigned to r3.
The routine improve_allocation() is called after graph coloring. In this
routine, IRA checks for each allocno if spilling any conflicting allocnos can
improve the 
allocation of this allocno.

Going into more detail, improve_allocation() does the following:
1. We first compute the cost improvement for usage of each profitable hard
register for a given allocno A. The cost improvement is computed as follows:

costs[regno] = A->hard_reg_costs[regno]   // ‘hard_reg_costs’ is an array of
usage 
 costs for each hard register
costs[regno] -= allocno_copy_cost_saving (A, regno);
costs[regno] -= base_cost;   //Say, ‘reg’ is assigned to A. Then ‘base_cost’ is 
   the usage cost of ‘reg’ for A.

2. Then we process each conflicting allocno of A and update the cost
improvement for the profitable hard registers of A. Basically, we compute the
spill costs of the conflicting allocnos and update the cost (for A) of the
register that was assigned to the conflicting allocno. 
3. We then find the best register among the profitable registers, spill the
conflicting allocno that uses this best register and assign the best register
to A.


However, the initial hard register costs for some of the profitable hard
registers is different in the passing and failing cases. More specifically, the
costs in hard_reg_costs[] array are 0 for regs 14-31 in the failing case. A
zero cost seems incorrect. If using a reg in the set [14..31] has zero cost,
then why wasn’t such a reg chosen for r118?
In the passing case, the costs in hard_reg_costs[] for regs 14-31 is 2000.
At the end of step 1, costs[r31] is -390 for failing case(for allocno r118) and
1610 for passing case (for allocno r117).

Another issue(?) is that in step 2, the only conficting allocno for r118 is the
allocno for r120 which is used to hold the value of the condition check. The
pseudo r120 has been assigned to r100 by the graph coloring step. But r100 is
not in the set of profitable hard registers for r118. (The profitable hard regs
are: [0, 3-12, 14-31]). So the allocno for r120 is not considered for spilling.
 And finally in step 3, r31 is assigned to r118, though r31 has not been
assigned to any conflicting allocno. Perhaps improve_allocation() should only
consider registers that have been assigned to conflicting allocnos, and not
other registers, since it’s stated aim is to see if spilling conflicting
allocnos can result in a better allocation.

I am investigating why hard_reg_costs[] has 0 cost for r14..r31.

[Bug target/103784] suboptimal code for returning bool value on target ppc

2023-03-06 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #13 from Surya Kumari Jangala  ---
Thanks David and Segher for your comments. I wanted to note down my analysis
and thoughts from when I had worked on this bug in January. Ajit is looking
into it now.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-03-05 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #3 from Surya Kumari Jangala  ---
For the working case:

 * Input RTL to the IRA pass:
BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118
  return


 * RTL after the IRA pass:
same as the input

 * RTL after reload pass:
BB2:
  set r100, compare(r4, 0)
  if r100 jump BB4 else jump BB3
BB3:
  set mem(r1+32), r3
  call bar()
  set r3, mem(r1+32)

---

For the failing case:

 * Input RTL to the IRA pass:
BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118+1
  return r3

  * RTL after IRA pass
same as the input

  * RTL after reload pass
BB2:
  set r100, compare(r4, 0)
  set r31, r3
  if r100 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r31+1
  return r3



After the IRA pass, the IR looks very similar in both the working and failing
testcases. But the reload pass produces different IR. 
The live ranges seem to be properly split after IRA. I will be checking why the
reload/LRA pass produces different IR.

[Bug target/103784] suboptimal code for returning bool value on target ppc

2023-03-05 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #10 from Surya Kumari Jangala  ---
After the expand pass, we have a single return bb which first zero extends r117
(this reg holds the return value which has been set by predecessor blocks).
Zero extension is done because r117 is of mode QI and we want to change the
mode to DI. After zero extension, r117 is copied to r3.

The input RTL to the peephole2 pass is similar, ie, the writing of value to r3
occurs in predecessor BBs while zero extension of r3 happens in the return bb.
So we cannot do any peephole optimization to get rid of the unnecessary zero
extension.
Note that when return value is written into r3, it has mode QI. Later in the
return bb, r3 is zero extended to convert it's mode into DI.

However, after the bbro (basic block reordering) pass, we have 2 return BBs.
And in each BB, the return value is copied into r3 (in QI mode), and then r3 is
zero extended. Note that bbro occurs after peephole2.
We can do another peephole after bbro, and get rid of the unnecessary zero
extension.
However, we need not always get an opportunity to do a peephole. That is, the
instructions that write into r3 and zero extend r3 can be in different BBs.

A possible solution to this issue would be to have a separate pass that can
remove the zero extends.

In brief, the new pass can do the following:

Fist create webs.
Then find definitions (that is, writes into registers) that reach zero extend
insns.
Mark such definitions (to indicate that the value is going to be zero extended
later on), and then at the time of assembly generation (final pass),
definitions which have been marked should be converted to assembly instructions
which work on the extended mode (for example, with -m64, the generated assembly
should work on the entire 64bit register instead of just a part of it.).
If we generate such assembly instructions, then the zero extend instruction can
be removed, ie, no assembly need be generated.

Note that for definitions that reach zero extends as well as other uses, we
cannot remove the zero extends.

[Bug rtl-optimization/109009] Shrink Wrap missed opportunity

2023-03-04 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

--- Comment #2 from Surya Kumari Jangala  ---
For the working testcase:

long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i;
}

The input RTL to the shrink wrap pass is:

BB2:
   set r100, compare(r4, 0)
   if r100 jump BB4 else jump BB3

BB3:
   set mem(r1+32), r3
   call bar()
   set r3, mem(r1+32)

BB4:
   return r3

The shrink wrap pass concludes that only BB3 requires a prolog and successfully
does shrink wrapping.

For the non-working case:
long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i+1;
}

The input rtl to the shrink wrap pass is:

BB2:
   set r100, compare(r4, 0)
   set r31, r3
   if r100 jump BB4 else jump BB3

BB3:
   call bar()

BB4:
   set r3, r31+1
   return r3

The shrink wrap pass decides that both BB2 and BB3 require a prolog, and
finally the prolog is placed in BB2. Thus shrink wrapping fails for this test.

[Bug rtl-optimization/109009] New: Shrink Wrap missed opportunity

2023-03-03 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109009

Bug ID: 109009
   Summary: Shrink Wrap missed opportunity
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jskumari at gcc dot gnu.org
  Target Milestone: ---

The following test case does not get shrink wrapped:

void bar (void);

long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i+1;
}

However, if we change the return statement from
   return i+1;
to
   return i;

then shrink wrapping kicks in.

[Bug target/106770] powerpc64le: Unnecessary xxpermdi before mfvsrd

2023-03-02 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #12 from Surya Kumari Jangala  ---
(In reply to Jens Seifert from comment #6)
> The left part of VSX registers overlaps with floating point registers, that
> is why no register xxpermdi is required and mfvsrd can access all (left)
> parts of VSX registers directly.
> The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register
> which might also cause the xxpermdi at the beginning. At the end the mystery
> is why gcc adds 3 xxpermdi to the code.

The 3rd xxpermdi is incorrect.

[Bug target/106770] powerpc64le: Unnecessary xxpermdi before mfvsrd

2023-03-02 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #10 from Surya Kumari Jangala  ---
The swap pass analyzes vector computations and removes unnecessary doubleword
swaps (xxswapdi instructions). The swap pass first constructs webs and removes
swap instructions if possible. If the web contains operations that are
sensitive to element order, such as an extract, then such instructions should
be modified. For example, the lane is changed on an extract operation.

As we saw in comment 9, the swap pass has incorrectly changed the lane of the
vec_extract. The swap pass modifies the extract operation even though there are
no swap instructions in the web. This is a bug in the swap pass. It should
modify only those operations which are present in webs having swap
instructions.

[Bug target/106770] powerpc64le: Unnecessary xxpermdi before mfvsrd

2023-03-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #9 from Surya Kumari Jangala  ---
RTL after dfinit pass for the vec_sub() and the vec_extract():

(insn 13 12 14 2 (set (reg:V2DI 132 [ vrD.3952 ])
(minus:V2DI (subreg:V2DI (reg:V2DF 117 [ _1 ]) 0)
(subreg:V2DI (reg:V2DF 118 [ _2 ]) 0))) "cpm2.c":9:29 1689
{subv2di3}
 (nil))
(insn 14 13 15 2 (set (reg:DI 133)
(vec_select:DI (reg:V2DI 132 [ vrD.3952 ])
(parallel [
(const_int 1 [0x1])
]))) "cpm2.c":11:12 1371 {*vsx_extract_v2di_0}
 (nil))
(insn 15 14 16 2 (set (reg:DI 119 [ _3 ])
(reg:DI 133)) "cpm2.c":11:12 679 {*movdi_internal64}
 (nil))
(insn 16 15 17 2 (set (reg:SI 134)
(subreg:SI (reg:DI 119 [ _3 ]) 0)) "cpm2.c":11:12 discrim 1 555
{*movsi_internal1}
 (nil))
(insn 17 16 18 2 (set (reg:DI 135)
(sign_extend:DI (reg:SI 134))) "cpm2.c":11:12 discrim 1 31
{extendsidi2}
 (nil))
(insn 18 17 22 2 (set (reg:DI 127 [  ])
(reg:DI 135)) "cpm2.c":11:12 discrim 1 679 {*movdi_internal64}
 (nil))
(insn 22 18 23 2 (set (reg/i:DI 3 3)
(reg:DI 127 [  ])) "cpm2.c":12:1 679 {*movdi_internal64}
 (nil))
(insn 23 22 0 2 (use (reg/i:DI 3 3)) "cpm2.c":12:1 -1
 (nil))

--

RTL after swaps pass:

(insn 13 12 14 2 (set (reg:V2DI 132 [ vrD.3952 ])
(minus:V2DI (subreg:V2DI (reg:V2DF 117 [ _1 ]) 0)
(subreg:V2DI (reg:V2DF 118 [ _2 ]) 0))) "cpm2.c":9:29 1689
{subv2di3}
 (nil))
(insn 14 13 15 2 (set (reg:DI 133)
(vec_select:DI (reg:V2DI 132 [ vrD.3952 ])
(parallel [
(const_int 0 [0])
]))) "cpm2.c":11:12 -1
 (nil))

--

'swaps' pass occurs after 'dfinit' pass.
After dfinit pass, we are extracting the correct element (1st element). See the
vec_select in insn 14. But after 'swaps' pass, the element being extracted
changes to 0.

[Bug target/106770] powerpc64le: Unnecessary xxpermdi before mfvsrd

2023-03-01 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #8 from Surya Kumari Jangala  ---
While the first two xxpermdi's are fine, the 3rd one is a bug. It is incorrect.
Here is the C code inlined into assembly:

_Z4cmp2dd:
.LFB1:
.cfi_startproc

   // vector double va = vec_promote(a, 1);
xxpermdi 1,1,1,0  

   // vector double vb = vec_promote(b, 1);
xxpermdi 2,2,2,0

   // vector long long vlt = (vector long long)vec_cmplt(va, vb);
xvcmpgtdp 33,2,1

   // vector long long vgt = (vector long long)vec_cmplt(vb, va);
xvcmpgtdp 32,1,2

   // vector signed long long vr = vec_sub(vlt, vgt);
vsubudm 0,1,0

   // return vec_extract(vr, 1);
xxpermdi 0,32,32,3
mfvsrd 3,0
extsw 3,3
blr

The code generated for vec_extract(vr, 1) is incorrect. We want to extract the
1st element but 'xxpermdi 0,32,32,3' copies the 0th element of vsx32 into the
0th and 1st elements of vsx0. The mfvsrd instruction does extract the 1st
element, but this is not the first element of the result of the vec_sub.

[Bug target/103784] suboptimal code for returning bool value on target ppc

2023-02-28 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #9 from Surya Kumari Jangala  ---
The same issue of unnecessary rldicl instruction is there if we change return
value from bool to int.

int foo (int a, int b)
{
  if (a > 2)
return 0;
  if (b < 10)
return 1;
  return 0;
}


cmpwi 0,3,2
bgt 0,.L3
subfic 4,4,9
srdi 3,4,63
xori 3,3,0x1
rldicl 3,3,0,63
blr
.p2align 4,,15
.L3:
li 3,0
rldicl 3,3,0,63
blr

[Bug target/103784] suboptimal code for returning bool value on target ppc

2023-01-05 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #8 from Surya Kumari Jangala  ---
Using -O3 with gcc13, I got (with the test in comment 2):

For P8:
cmpwi 0,3,2
bgt 0,.L3
subfic 4,4,9
srdi 3,4,63
xori 3,3,0x1
rldicl 3,3,0,63
blr
.p2align 4,,15
.L3:
li 3,0
rldicl 3,3,0,63
blr


For P10:
cmpwi 0,3,2
bgt 0,.L3
cmpwi 0,4,9
setbcr 3,1
rldicl 3,3,0,63
blr
.p2align 4,,15
.L3:
li 3,0
rldicl 3,3,0,63
blr

[Bug middle-end/108073] [rs6000] sub-optimal float member accessing on struct parameter

2022-12-20 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108073

Surya Kumari Jangala  changed:

   What|Removed |Added

 CC||jskumari at gcc dot gnu.org

--- Comment #1 from Surya Kumari Jangala  ---
Hi Jiu Fu Guo, are you working on this bug? If not, I would like to take this
up.

[Bug testsuite/107171] New test case gcc.target/powerpc/pr105586.c fails after its introduction in r13-2525-gbec35caafae8db

2022-11-29 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107171

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Surya Kumari Jangala  ---
Fixed the failing testcase.

[Bug rtl-optimization/106418] '-fcompare-debug' failure w/ -mcpu=e500mc -O2 -fnon-call-exceptions -fsched-stalled-insns -fno-reorder-blocks -fno-thread-jumps -fno-tree-dce

2022-11-22 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106418

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Surya Kumari Jangala  ---
The issue reported here is the same as in bug 105586
The test provided in the description passes with the patch that fixes bug
105586, and fails w/o the patch.
Hence closing this bug.

[Bug rtl-optimization/105586] [11/12 Regression] -fcompare-debug failure (length) with -O2 -fno-if-conversion -mtune=power4 -fno-guess-branch-probability

2022-11-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105586

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #13 from Surya Kumari Jangala  ---
Closing the bug as it is fixed on trunk and we don't plan to backport it.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-11-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Surya Kumari Jangala  changed:

   What|Removed |Added

 Status|ASSIGNED|WAITING

--- Comment #21 from Surya Kumari Jangala  ---
There are two options to resolve the issue:

1. Use the BIND(C) directive on the fortran callee (DGEBAL) to make it
interoperable with the caller which is written in C. As described in comment
19, using this directive removed accesses to the caller's frame.

2. As described in
(https://gcc.gnu.org/onlinedocs/gfortran/Argument-passing-conventions.html),
since the first parameter to DGEBAL is of type CHARACTER, there is an extra
hidden argument. Change the call to DGEBAL from dgebal (the flexiBLAS wrapper
routine) to take an extra argument. This causes the compiler to allocate a
parameter save area in dgebal's frame, as there are now 9 parameters but only 8
parameter registers.

[Bug rtl-optimization/105586] [11/12 Regression] -fcompare-debug failure (length) with -O2 -fno-if-conversion -mtune=power4 -fno-guess-branch-probability

2022-11-08 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105586

--- Comment #12 from Surya Kumari Jangala  ---
Richard has clarified here
(https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605386.html) that
backporting is not required.

[Bug rtl-optimization/105586] [11/12 Regression] -fcompare-debug failure (length) with -O2 -fno-if-conversion -mtune=power4 -fno-guess-branch-probability

2022-11-08 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105586

--- Comment #10 from Surya Kumari Jangala  ---
(In reply to Segher Boessenkool from comment #9)
> I read 
> as approval to backport, fwiw :-)

I read that as: Since it is *not* a regression, no need to backport.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #19 from Surya Kumari Jangala  ---
There is a keyword called BIND(C) which can be specified on a Fortran procedure
to make it interoperable.
I tried this keyword on DGEBAL fortran routine which is a part of the openblas
library and it worked! I did not see any REG_EQUIV notes after the expand pass,
and the final assembly did not have accesses to the caller's frame.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #18 from Surya Kumari Jangala  ---
I git cloned and built flexiblas to see what is the frame size and what is the
assembly code generated for the flexiblas C wrapper routine for dgebal.

The important assembly code snippets for dgebal.c :

// r23-r31 are saved in the callee frame
   std r23,-72(r1)
   std r24,-64(r1)
   ...
   ...
   std r31,-8(r1)

// allocate the stack frame
   stdur1,-112(r1)

// save the parameter registers r3-r10 into r23-r30
   mr  r30,r3
   ...
   mr  r23,r10

// some of the param regs are used as temps
   ld  r3,0(r31)
   lwz r11,16(r3)

// populate the param registers appropriately
   mr  r3,r30
   ...
   mr  r10,r23

// make the call to the fortran dgebal routine
   bctrl

// restore r1
   addir1,r1,112

// restore r23-r31
   ld  r23,-72(r1)
   ...
   ld  r31,-8(r1)

// return
   blr

As we can see, the frame size allocated is only 112 out of which 32 is for
things like LR, TOC etc. and 72 is needed to save r23-r31. So clearly, the
wrapper routine is not allocating any parameter save area in it's frame.
Now, the dgebal fortran routine writes into the caller's frame thereby
corrupting a callee save register (one of r23-r31). So when control returns
back from the wrapper routine to the fortran routine dgeev, we see a corrupted
value.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #17 from Surya Kumari Jangala  ---
I analysed the reduced test case specified in comment 15. In the .s file, the
callee decrements r1 by 224, ie, callee’s frame size is 224. But there is an
instruction in the callee that accesses into the caller’s frame at (r1+272).
At first glance this looks odd, even incorrect, but after further analysis, I
am not sure if this is incorrect.
If we look at the RTL dumps, the offset 272 is introduced in ‘reload’. ‘Insn 4’
stores into (r1+272). 

‘Insn 4’ after vregs:

(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 5 5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


‘Insn 4’ after IRA:

(insn 4 214 237 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 262)) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_DEAD (reg:DI 262)
(expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil

‘Insn 4’ after reload:

(insn 4 214 19 2 (set (mem/f/c:DI (plus:DI (reg/f:DI 1 1)
(const_int 272 [0x110])) [3 arrayD.2714+0 S8 A64])
(reg:DI 5 5 [262])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


As we can see, during vregs phase, we are moving r5 to r177 and r177 is equiv
to (ap+48). ‘ap’ (r99) is the base register for access to arguments of the
function.

In the gcc code:
#define ARG_POINTER_REGNUM 99

During vregs phase, not just r5, but all registers from r3-r10 are moved to
pseudo registers and these pseudo regs are equivalent to (ap+’offset’) with
‘offset’ starting from 32 for r3 and going on till 88 for r10. Note that ap
points to the beginning of the callee frame, hence to access the parameter save
area of the caller’s frame, 32 needs to be added to ap.

During LRA, in curr_insn_transform(), we make equivalence substitution and
change r177 to r1+272. (272 because r177 is equivalent to ap+48, and ap equals
r1+224, so ap+48 = r1+272). 

The argument registers r3-r10 are saved as they need to be reused to pass
parameters to functions called from the callee. But not all parameter registers
are spilled to the stack. For example, r6 is saved in r24. We can see this
after the “final” phase:

(insn 5 289 19 (set (reg/v/f:DI 24 %r24 [orig:178 ldaD.2715 ] [178])
(reg:DI 6 %r6 [263])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 56 [0x38])) [6 ldaD.2715+0 S8 A64])
(nil)))

I guess r5 had to be spilled to stack because there were no free registers.

Also, note that there is a load from (r1+272) in the reduced test case. This
shows that the value in r5 is needed, and hence it has to be saved somewhere.

I ran the test case with the options: -mcpu=power8 -O2 -fPIC

If -fPIC option is removed, we do not see any access to the caller’s frame in
the generated assembly. But it does have instructions that save the parameter
registers into other registers. I suppose the parameter registers did not have
to be saved on stack (ie, in the caller’s parameter save area) because there
were enough registers available. That is, perhaps there is lesser register
pressure without -fPIC.

After vregs:
(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 5 %r5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])

After reload:
(insn 4 214 19 2 (set (reg/v/f:DI 17 %r17 [orig:177 arrayD.2714 ] [177])
(reg:DI 5 %r5 [262])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


To summarise, the reduced testcase seems to be correctly compiled. So I shifted
my focus to the original fortran file dgebal.f in the openBLAS library.


In dgebal.f too we have some instructions accessing the caller’s parameter save
area. These are the interesting snippets of instructions from the assembly
code: 

   // The original contents of r23 are spilled.
std %r23,-192(%r1)
   // r3 is saved in r23
mr %r23,%r3
   // frame is allocated
stdu %r1,-400(%r1)

  // restore r3 contents before making call to lsame_. There are several calls
to lsame_ and 
  // each time, r3 is restored.
mr %r3,%r23
bl lsame_

   // save r23 to the stack because we are running out of registers and we need
a free reg.
   // Note that we are saving to the caller’s frame into the parameter save
area. And we 
   // are saving to (400+32) which is the
   // location that r3 would have been spilled

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-09-18 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #15 from Surya Kumari Jangala  ---
(In reply to Segher Boessenkool from comment #14)
> What is the exact command line (and relevant configuration!) required to
> reproduce this?

The reduced testcase is:

  SUBROUTINE DGEBAL( JOB, N, ARRAY, LDA, ILO, IHI, SCALE, INFO )
  CHARACTER  JOB
  DOUBLE PRECISION   ARRAY( LDA, * ), SCALE( * )
  LOGICALNOCONV
  140 CONTINUE
  DO 200 I = K, L
 C = DNRM2( L-K+1, ARRAY( K, I ), 1 )
 R = DNRM2( L-K+1, ARRAY( I, K ), LDA )
 ICA = IDAMAX( L, ARRAY( 1, I ), 1 )
 CA = ABS( ARRAY( ICA, I ) )
 IF( C.EQ.ZERO .OR. R.EQ.ZERO )
 $  GO TO 200
 IF( G.LT.R .OR. MAX( R, RA ).GE.SFMAX2 .OR.
 $   MIN( F, C, G, CA ).LE.SFMIN2 )GO TO 190
 F = F / SCLFAC
 G = G / SCLFAC
  190CONTINUE
 CALL DSCAL( N-K+1, G, ARRAY( I, K ), LDA )
  200 CONTINUE
  IF( NOCONV )
 $   GO TO 140
  END


The options to use to reproduce: -mcpu=power8 -O2 -fPIC

[Bug rtl-optimization/105041] '-fcompare-debug' failure w/ -mcpu=power6 -O2 -fharden-compares -frename-registers

2022-06-15 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105041

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #9 from Surya Kumari Jangala  ---
Reopening the bug as we need to backport the fix.

[Bug rtl-optimization/105041] '-fcompare-debug' failure w/ -mcpu=power6 -O2 -fharden-compares -frename-registers

2022-06-14 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105041

Surya Kumari Jangala  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #8 from Surya Kumari Jangala  ---
Fixed.

[Bug debug/105586] [11/12/13 Regression] -fcompare-debug failure (length) with -O2 -fno-if-conversion -mtune=power4 -fno-guess-branch-probability

2022-05-19 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105586

Surya Kumari Jangala  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |jskumari at gcc dot 
gnu.org

--- Comment #4 from Surya Kumari Jangala  ---
I will look into the issue.

[Bug debug/105041] '-fcompare-debug' failure w/ -mcpu=power6 -O2 -fharden-compares -frename-registers

2022-04-06 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105041

--- Comment #6 from Surya Kumari Jangala  ---
I will be debugging the issue to figure the root cause.