[gcc r15-2036] arm: Fix the expected output of the test pr111235.c [PR115894]

2024-07-15 Thread Surya Kumari Jangala via Gcc-cvs
https://gcc.gnu.org/g:60ba989220d9dec07d82009b0dafe684e652577f

commit r15-2036-g60ba989220d9dec07d82009b0dafe684e652577f
Author: Surya Kumari Jangala 
Date:   Mon Jul 15 00:03:06 2024 -0500

arm: Fix the expected output of the test pr111235.c  [PR115894]

With r15-1619-g3b9b8d6cfdf593, pr111235.c fails due to different
registers used in ldrexd instruction. The key part of this test is that
the compiler generates LDREXD. The registers used for that are pretty
much irrelevant as they are not matched with any other operations within
the test. This patch changes the test to test only for the mnemonic and
not for any of the operands.

2024-07-15  Surya Kumari Jangala  

gcc/testsuite:
PR testsuite/115894
* gcc.target/arm/pr111235.c: Update expected output.

Diff:
---
 gcc/testsuite/gcc.target/arm/pr111235.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/arm/pr111235.c 
b/gcc/testsuite/gcc.target/arm/pr111235.c
index b06a5bfb8e29..1f732cab983a 100644
--- a/gcc/testsuite/gcc.target/arm/pr111235.c
+++ b/gcc/testsuite/gcc.target/arm/pr111235.c
@@ -31,7 +31,7 @@ void t3 (long long *p, int x)
 atomic_store_explicit (p, x, memory_order_relaxed);
 }
 
-/* { dg-final { scan-assembler-times "ldrexd\tr\[0-9\]+, r\[0-9\]+, 
\\\[r\[0-9\]+\\\]" 2 } } */
+/* { dg-final { scan-assembler-times "ldrexd\t" 2 } } */
 /* { dg-final { scan-assembler-not "ldrgt" } } */
 /* { dg-final { scan-assembler-not "ldrdgt" } } */
 /* { dg-final { scan-assembler-not "ldrexdgt" } } */


[gcc r15-2034] aarch64: Fix the expected output of the test cpy_1.c [PR115892]

2024-07-14 Thread Surya Kumari Jangala via Gcc-cvs
https://gcc.gnu.org/g:8b1492012e5a11e9400e30ee4ae9195c08a2a81e

commit r15-2034-g8b1492012e5a11e9400e30ee4ae9195c08a2a81e
Author: Surya Kumari Jangala 
Date:   Thu Jul 11 11:02:17 2024 -0500

aarch64: Fix the expected output of the test cpy_1.c [PR115892]

The fix at r15-1619-g3b9b8d6cfdf593 results in a rearrangement of
instructions generated for cpy_1.c. This patch fixes the expected output.

2024-07-12  Surya Kumari Jangala  

gcc/testsuite:
PR testsuite/115892
* gcc.target/aarch64/sve/acle/general/cpy_1.c: Update expected
output.

Diff:
---
 gcc/testsuite/gcc.target/aarch64/sve/acle/general/cpy_1.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cpy_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cpy_1.c
index 57b56a7e256f..1d669913df2e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cpy_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/general/cpy_1.c
@@ -11,9 +11,15 @@ extern "C" {
 /*
 ** dup_x0_m:
 ** ...
+** (
 ** add (x[0-9]+), x0, #?1
 ** mov (p[0-7])\.b, p15\.b
 ** mov z0\.d, \2/m, \1
+** |
+** mov (p[0-7])\.b, p15\.b
+** add (x[0-9]+), x0, #?1
+** mov z0\.d, \3/m, \4
+** )
 ** ...
 ** ret
 */


Re: [Linaro-TCWG-CI] gcc-15-1619-g3b9b8d6cfdf5: FAIL: 4 regressions on aarch64

2024-06-27 Thread Surya Kumari Jangala via Gcc-regression
Hi,
I am looking into these regressions.

Regards,
Surya

On 26/06/24 7:51 am, ci_not...@linaro.org wrote:
> Dear contributor, our automatic CI has detected problems related to your 
> patch(es).  Please find some details below.  If you have any questions, 
> please follow up on linaro-toolch...@lists.linaro.org mailing list, Libera's 
> #linaro-tcwg channel, or ping your favourite Linaro toolchain developer on 
> the usual project channel.
> 
> We appreciate that it might be difficult to find the necessary logs or 
> reproduce the issue locally. If you can't get what you need from our CI 
> within minutes, let us know and we will be happy to help.
> 
> We track this report status in https://linaro.atlassian.net/browse/GNU-1270 , 
> please let us know if you are looking at the problem and/or when you have a 
> fix.
> 
> In gcc_check master-aarch64 after:
> 
>   | commit gcc-15-1619-g3b9b8d6cfdf5
>   | Author: Surya Kumari Jangala 
>   | Date:   Tue Jun 25 08:37:49 2024 -0500
>   | 
>   | ira: Scale save/restore costs of callee save registers with block 
> frequency
>   | 
>   | In assign_hard_reg(), when computing the costs of the hard registers, 
> the
>   | cost of saving/restoring a callee-save hard register in prolog/epilog 
> is
>   | taken into consideration. However, this cost is not scaled with the 
> entry
>   | block frequency. Without scaling, the cost of saving/restoring is 
> quite
>   | small and this can result in a callee-save register being chosen by
>   | ... 14 lines of the commit log omitted.
> 
> FAIL: 4 regressions
> 
> regressions.sum:
>   === g++ tests ===
> 
> Running g++:g++.target/aarch64/sve/acle/aarch64-sve-acle.exp ...
> FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
> -moverride=tune=none  check-function-bodies dup_x0_m
>   === gcc tests ===
> 
> Running gcc:gcc.dg/dg.exp ...
> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing 
> shrink-wrapping"
> 
> Running gcc:gcc.target/aarch64/sve/acle/aarch64-sve-acle.exp ...
> ... and 5 more entries
> 
> You can find the failure logs in *.log.1.xz files in
>  - 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2235/artifact/artifacts/00-sumfiles/
> The full lists of regressions and progressions as well as configure and make 
> commands are in
>  - 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2235/artifact/artifacts/notify/
> The list of [ignored] baseline and flaky failures are in
>  - 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2235/artifact/artifacts/sumfiles/xfails.xfail
> 
> The configuration of this build is:
> CI config tcwg_gcc_check master-aarch64
> 
> -8<--8<--8<--
> The information below can be used to reproduce a debug environment:
> 
> Current build   : 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2235/artifact/artifacts
> Reference build : 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2234/artifact/artifacts
> 
> Reproduce last good and first bad builds: 
> https://git-us.linaro.org/toolchain/ci/interesting-commits.git/plain/gcc/sha1/3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b/tcwg_gcc_check/master-aarch64/reproduction_instructions.txt
> 
> Full commit : 
> https://github.com/gcc-mirror/gcc/commit/3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
> 
> List of configurations that regressed due to this commit :
> * tcwg_gcc_check
> ** master-aarch64
> *** FAIL: 4 regressions
> *** 
> https://git-us.linaro.org/toolchain/ci/interesting-commits.git/plain/gcc/sha1/3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b/tcwg_gcc_check/master-aarch64/details.txt
> *** 
> https://ci.linaro.org/job/tcwg_gcc_check--master-aarch64-build/2235/artifact/artifacts


[gcc r15-1619] ira: Scale save/restore costs of callee save registers with block frequency

2024-06-25 Thread Surya Kumari Jangala via Gcc-cvs
https://gcc.gnu.org/g:3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b

commit r15-1619-g3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
Author: Surya Kumari Jangala 
Date:   Tue Jun 25 08:37:49 2024 -0500

ira: Scale save/restore costs of callee save registers with block frequency

In assign_hard_reg(), when computing the costs of the hard registers, the
cost of saving/restoring a callee-save hard register in prolog/epilog is
taken into consideration. However, this cost is not scaled with the entry
block frequency. Without scaling, the cost of saving/restoring is quite
small and this can result in a callee-save register being chosen by
assign_hard_reg() even though there are free caller-save registers
available. Assigning a callee save register to a pseudo that is live
in the entire function and across a call will cause shrink wrap to fail.

2024-06-25  Surya Kumari Jangala  

gcc/
PR rtl-optimization/111673
* ira-color.cc (assign_hard_reg): Scale save/restore costs of
callee save registers with block frequency.

gcc/testsuite/
PR rtl-optimization/111673
* gcc.target/powerpc/pr111673.c: New test.

Diff:
---
 gcc/ira-color.cc|  4 +++-
 gcc/testsuite/gcc.target/powerpc/pr111673.c | 17 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index b9ae32d1b4d..ca32a23a0c9 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -2178,7 +2178,9 @@ assign_hard_reg (ira_allocno_t a, bool retry_p)
add_cost = ((ira_memory_move_cost[mode][rclass][0]
 + ira_memory_move_cost[mode][rclass][1])
* saved_nregs / hard_regno_nregs (hard_regno,
- mode) - 1);
+ mode) - 1)
+  * (optimize_size ? 1 :
+ REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun)));
cost += add_cost;
full_cost += add_cost;
  }
diff --git a/gcc/testsuite/gcc.target/powerpc/pr111673.c 
b/gcc/testsuite/gcc.target/powerpc/pr111673.c
new file mode 100644
index 000..e0c0f85460a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr111673.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+
+/* Verify there is an early return without the prolog and shrink-wrap
+   the function. */
+
+int f (int);
+int
+advance (int dz)
+{
+  if (dz > 0)
+return (dz + dz) * dz;
+  else
+return dz * f (dz);
+}
+
+/* { dg-final { scan-rtl-dump-times "Performing shrink-wrapping" 1 
"pro_and_epilogue" } } */


Re: “ira_may_move_out_cost” vs “ira_register_move_cost”

2024-06-18 Thread Surya Kumari Jangala via Gcc
Hi Vladimir,

On 14/06/24 10:56 pm, Vladimir Makarov wrote:
> 
> On 6/13/24 00:34, Surya Kumari Jangala wrote:
>> Hi Vladimir,
>> With my patch for PR111673 (scale the spill/restore cost of callee-save
>> register with the frequency of the entry bb in the routine assign_hard_reg() 
>> :
>> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html), the
>> following Linaro aarch64 test failed due to an extra 'mov' instruction:
>>
>> __SVBool_t __attribute__((noipa))
>> callee_pred (__SVBool_t p0, __SVBool_t p1, __SVBool_t p2, __SVBool_t p3,
>>   __SVBool_t mem0, __SVBool_t mem1)
>> {
>>    p0 = svbrkpa_z (p0, p1, p2);
>>    p0 = svbrkpb_z (p0, p3, mem0);
>>    return svbrka_z (p0, mem1);
>> }
>>
>> With trunk:
>>  addvl   sp, sp, #-1
>>  str p14, [sp]
>>  str p15, [sp, #1, mul vl]
>>  ldr p14, [x0]
>>  ldr p15, [x1]
>>  brkpa   p0.b, p0/z, p1.b, p2.b
>>  brkpb   p0.b, p0/z, p3.b, p14.b
>>  brka    p0.b, p0/z, p15.b
>>  ldr p14, [sp]
>>  ldr p15, [sp, #1, mul vl]
>>  addvl   sp, sp, #1
>>  ret
>>
>> With patch:
>>    addvl   sp, sp, #-1
>>  str p14, [sp]
>>  str p15, [sp, #1, mul vl]
>>  mov p14.b, p3.b  // extra mov insn
>>  ldr p15, [x0]
>>  ldr p3, [x1]
>>  brkpa   p0.b, p0/z, p1.b, p2.b
>>  brkpb   p0.b, p0/z, p14.b, p15.b
>>  brka    p0.b, p0/z, p3.b
>>  ldr p14, [sp]
>>  ldr p15, [sp, #1, mul vl]
>>  addvl   sp, sp, #1
>>  ret
>>
>> p0-p15 are predicate registers on aarch64 where p0-p3 are caller-save while
>> p4-p15 are callee-save.
>>
>> The input RTL for ira pass:
>>
>> 1:   set r112, r68    #p0
>> 2:   set r113, r69    #p1
>> 3:   set r114, r70    #p2
>> 4:   set r115, r71    #p3
>> 5:   set r116, x0 #mem0, the 5th parameter
>> 6:   set r108, mem(r116)
>> 7:   set r117, x1 #mem1, the 6th parameter
>> 8:   set r110, mem(r117)
>> 9:   set r100, unspec_brkpa(r112, r113, r114)
>> 10:  set r101, unspec_brkpb(r100, r115, r108)
>> 11:  set r68,  unspec_brka(r101, r110)
>> 12:  ret r68
>>
>> Here, r68-r71 represent predicate hard regs p0-p3.
>> With my patch, r113 and r114 are being assigned memory by ira but with trunk 
>> they are
>> assigned registers. This in turn leads to a difference in decisions taken by 
>> LRA
>> ultimately leading to the extra mov instruction.
>>
>> Register assignment w/ patch:
>>
>>    Popping a5(r112,l0)  -- assign reg p0
>>    Popping a2(r100,l0)  -- assign reg p0
>>    Popping a0(r101,l0)  -- assign reg p0
>>    Popping a1(r110,l0)  -- assign reg p3
>>    Popping a3(r115,l0)  -- assign reg p2
>>    Popping a4(r108,l0)  -- assign reg p1
>>    Popping a6(r113,l0)  -- (memory is more profitable 8000 vs 9000) 
>> spill!
>>    Popping a7(r114,l0)  -- (memory is more profitable 8000 vs 9000) 
>> spill!
>>    Popping a8(r117,l0)  -- assign reg 1
>>    Popping a9(r116,l0)  -- assign reg 0
>>
>>
>> With patch, cost of memory is 8000 and it is lesser than the cost of 
>> callee-save
>> register (9000) and hence memory is assigned to r113 and r114. It is 
>> interesting
>> to see that all the callee-save registers are free but none is chosen.
>>
>> The two instructions in which r113 is referenced are:
>> 2:  set r113, r69 #p1
>> 9:  set r100, unspec_brkpa(r112, r113, r114)
>>
>> IRA computes the memory cost of an allocno in find_costs_and_classes(). In 
>> this routine
>> IRA scans each insn and computes memory cost and cost of register classes 
>> for each
>> operand in the insn.
>>
>> So for insn 2, memory cost of r113 is set to 4000 because this is the cost 
>> of storing
>> r69 to memory if r113 is assigned to memory. The possible register classes 
>> of r113
>> are ALL_REGS, PR_REGS, PR_HI_REGS and PR_LO_REGS. The cost of moving r69
>> to r113 if r113 is assigned a register from each of the possible register 
>> classes is
>> computed. If r113 is assigned a reg in ALL_REGS, then the cost of the
>> move is 18000, while if r113 is assigned a register from any of the 
>> predicate register
>> classes, then the cost of the move is 2000. This cost is obtained from the 
>> array
>> “ira_register_move_cost”. After scanning insn 9, memory cost of r113
>> is increased to 8000 because if r113 is assigned memory, we need a load to 
>> read the
>> value before using it in the unspec_brkpa. But the register class cost is 
>> unchanged.
>>
>> Later in setup_allocno_class_and_costs(), the ALLOCNO_CLASS of r113 is set 
>> to PR_REGS.
>> The ALLOCNO_MEMORY_COST of r113 is set to 8000.
>> The ALLOCNO_HARD_REG_COSTS of each register in PR_REGS is set to 2000.
>>
>> During coloring, when r113 has to be assigned a register, the cost of 
>> callee-save
>> registers in PR_REGS is increased by the spill/restore cost. 

“ira_may_move_out_cost” vs “ira_register_move_cost”

2024-06-12 Thread Surya Kumari Jangala via Gcc
Hi Vladimir,
With my patch for PR111673 (scale the spill/restore cost of callee-save 
register with the frequency of the entry bb in the routine assign_hard_reg() : 
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html), the 
following Linaro aarch64 test failed due to an extra 'mov' instruction:

__SVBool_t __attribute__((noipa))
callee_pred (__SVBool_t p0, __SVBool_t p1, __SVBool_t p2, __SVBool_t p3,
 __SVBool_t mem0, __SVBool_t mem1)
{
  p0 = svbrkpa_z (p0, p1, p2);
  p0 = svbrkpb_z (p0, p3, mem0);
  return svbrka_z (p0, mem1);
}

With trunk:
addvl   sp, sp, #-1
str p14, [sp]
str p15, [sp, #1, mul vl]
ldr p14, [x0]
ldr p15, [x1]
brkpa   p0.b, p0/z, p1.b, p2.b
brkpb   p0.b, p0/z, p3.b, p14.b
brkap0.b, p0/z, p15.b
ldr p14, [sp]
ldr p15, [sp, #1, mul vl]
addvl   sp, sp, #1
ret

With patch:
   addvl   sp, sp, #-1
str p14, [sp]
str p15, [sp, #1, mul vl]
mov p14.b, p3.b  // extra mov insn
ldr p15, [x0]
ldr p3, [x1]
brkpa   p0.b, p0/z, p1.b, p2.b
brkpb   p0.b, p0/z, p14.b, p15.b
brkap0.b, p0/z, p3.b
ldr p14, [sp]
ldr p15, [sp, #1, mul vl]
addvl   sp, sp, #1
ret

p0-p15 are predicate registers on aarch64 where p0-p3 are caller-save while
p4-p15 are callee-save.

The input RTL for ira pass:

1:   set r112, r68#p0
2:   set r113, r69#p1
3:   set r114, r70#p2
4:   set r115, r71#p3
5:   set r116, x0 #mem0, the 5th parameter
6:   set r108, mem(r116)
7:   set r117, x1 #mem1, the 6th parameter
8:   set r110, mem(r117)
9:   set r100, unspec_brkpa(r112, r113, r114)
10:  set r101, unspec_brkpb(r100, r115, r108)
11:  set r68,  unspec_brka(r101, r110)
12:  ret r68

Here, r68-r71 represent predicate hard regs p0-p3.
With my patch, r113 and r114 are being assigned memory by ira but with trunk 
they are
assigned registers. This in turn leads to a difference in decisions taken by 
LRA 
ultimately leading to the extra mov instruction.

Register assignment w/ patch:

  Popping a5(r112,l0)  -- assign reg p0 
  Popping a2(r100,l0)  -- assign reg p0 
  Popping a0(r101,l0)  -- assign reg p0 
  Popping a1(r110,l0)  -- assign reg p3 
  Popping a3(r115,l0)  -- assign reg p2 
  Popping a4(r108,l0)  -- assign reg p1 
  Popping a6(r113,l0)  -- (memory is more profitable 8000 vs 9000) spill!
  Popping a7(r114,l0)  -- (memory is more profitable 8000 vs 9000) spill!
  Popping a8(r117,l0)  -- assign reg 1
  Popping a9(r116,l0)  -- assign reg 0


With patch, cost of memory is 8000 and it is lesser than the cost of callee-save
register (9000) and hence memory is assigned to r113 and r114. It is 
interesting 
to see that all the callee-save registers are free but none is chosen.

The two instructions in which r113 is referenced are:
2:  set r113, r69 #p1
9:  set r100, unspec_brkpa(r112, r113, r114)

IRA computes the memory cost of an allocno in find_costs_and_classes(). In this 
routine
IRA scans each insn and computes memory cost and cost of register classes for 
each
operand in the insn.

So for insn 2, memory cost of r113 is set to 4000 because this is the cost of 
storing
r69 to memory if r113 is assigned to memory. The possible register classes of 
r113 
are ALL_REGS, PR_REGS, PR_HI_REGS and PR_LO_REGS. The cost of moving r69
to r113 if r113 is assigned a register from each of the possible register 
classes is 
computed. If r113 is assigned a reg in ALL_REGS, then the cost of the
move is 18000, while if r113 is assigned a register from any of the predicate 
register 
classes, then the cost of the move is 2000. This cost is obtained from the 
array 
“ira_register_move_cost”. After scanning insn 9, memory cost of r113
is increased to 8000 because if r113 is assigned memory, we need a load to read 
the 
value before using it in the unspec_brkpa. But the register class cost is 
unchanged.

Later in setup_allocno_class_and_costs(), the ALLOCNO_CLASS of r113 is set to 
PR_REGS.
The ALLOCNO_MEMORY_COST of r113 is set to 8000.
The ALLOCNO_HARD_REG_COSTS of each register in PR_REGS is set to 2000.

During coloring, when r113 has to be assigned a register, the cost of 
callee-save 
registers in PR_REGS is increased by the spill/restore cost. So the cost
of callee-save registers increases from 2000 to 9000. All the caller-save 
registers
have been assigned to other allocnos, so for r113 memory is assigned
as memory is cheaper than callee-save registers.

However, for r108, the cost is 0 for register classes PR_REGS, PR_HI_REGS and 
PR_LO_REGS.

References of r108:
6:   set r108, mem(r116)
10:  set r101, unspec_brkpb(r100, r115, r108)

It was surprising that while for r113, the cost of the predicate register 
classes 
was 2000, for 

Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2024-01-28 Thread Surya Kumari Jangala via Gcc
then LRA generates a spill immediately after the write and a restore 
> immediately
> before the read. The spill is needed because the call insn will clobber the
> caller-save register.
> 
> In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 
> while
> the second EBB contains BB4. 
> 
> In BB2, there is a write to x1 in the insn : 
> set r92, r95 //r92 is assigned x1 and r95 is assigned x0
> 
> In BB3, there is a read of x1 after the call
> insn.
> set mem(r92), 0   // r92 is assigned x1
> 
> So LRA generates a spill in BB2 after the write to x1.
> 
> I have a patch (bootstrapped and regtested on powerpc) that makes changes in
> LRA to save caller-save registers before a call instead of after the write to 
> the
> caller-save register. With this patch, both the above test gets successfully
> shrink wrapped. After committing the patch for PR111673, I plan to get the 
> LRA fix reviewed.
> 
> Please let me know if you need more information.
> 
> Regards,
> Surya
> 
> 
> On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote:
>> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
>>> Hi Richard,
>>> Thanks a lot for your response!
>>>
>>> Another failure reported by the Linaro CI is as follows:
>>>
>>> Running gcc:gcc.dg/dg.exp ...
>>> FAIL: gcc.dg/ira-shrinkwrap-prep-1.c scan-rtl-dump pro_and_epilogue 
>>> "Performing shrink-wrapping"
>>> FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing 
>>> shrink-wrapping"
>>>
>>> I analyzed the failures and the root cause is the same for both the 
>>> failures.
>>>
>>> The test pr10474.c is as follows:
>>>
>>> void f(int *i)
>>> {
>>> if (!i)
>>> return;
>>> else
>>> {
>>> __builtin_printf("Hi");
>>> *i=0;
>>> }
>>> }
>>>
>>>
>>> With the patch (for PR111673), x1 (volatile) is being assigned to hold 
>>> value of
>>> x0 (first parameter). Since it is a volatile, x1 is saved to the stack as 
>>> there
>>> is a call later on. The save to the stack is generated in the LRA pass. The 
>>> save
>>> is generated in the entry basic block. Due to the usage of the stack 
>>> pointer in
>>> the entry bb, the testcase fails to be shrink wrapped.
>>
>> I'm not entirely sure I understand what you mean from a quick glance.  Do 
>> you mean that X1 has the /v flag marked on it (ie it's printed in dumps as 
>> "reg/v")?  If so, that's not volatile, it just means that the register is 
>> associated with a user variable (as opposed to a compiler-generated 
>> temporary variable):
>>
>> From the manual:
>>
>> @item REG_USERVAR_P (@var{x})
>> In a @code{reg}, nonzero if it corresponds to a variable present in
>> the user's source code.  Zero for temporaries generated internally by
>> the compiler.  Stored in the @code{volatil} field and printed as
>> @samp{/v}.
>>
>> There are several other cases where we re-use this bit on different RTL 
>> constructs to mean things other than 'volatile': it pretty much only has the 
>> conventional meaning on MEM objects.
>>
>>>
>>> The reason why LRA generates the store insn in the entry bb is as follows:
>>> LRA emits insns to save volatile registers in the inheritance/splitting 
>>> pass.
>>> In this pass, LRA builds EBBs (Extended Basic Block) and traverses the 
>>> insns in
>>> the EBBs in reverse order from the last insn to the first insn. When LRA 
>>> sees a
>>> write to a pseudo (that has been assigned a volatile register), and there 
>>> is a
>>> read following the write, with an intervening call insn between the write 
>>> and read,
>>> then LRA generates a spill immediately after the write and a restore 
>>> immediately
>>> before the read. In the above test, there is an EBB containing the entry bb 
>>> and
>>> the bb with the printf call. In the entry bb, there is a write to x1 
>>> (basically
>>> a copy from x0 to x1) and in the printf bb, there is a read of x1 after the 
>>> call
>>> insn. So LRA generates a spill in the entry bb.
>>>
>>> Without patch, x19 is chosen to hold the value of x0. Since x19 is a 
>>> non-volatile,
>>> the input RTL to the shrink wrap pass does not have any code to save x19 to 
>>>

Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2023-12-15 Thread Surya Kumari Jangala via Gcc
Hi Richard,
Here are more details about the testcase failure and my analysis/fix:

Testcase:

void f(int *i)
{
if (!i)
return;
else
{
__builtin_printf("Hi");
*i=0;
}
}

--

Assembly w/o patch:
cbz x0, .L7
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
mov x19, x0
adrpx0, .LC0
add x0, x0, :lo12:.LC0
bl  printf
str wzr, [x19]
ldr x19, [sp, 16]
ldp x29, x30, [sp], 32
ret
.p2align 2,,3
.L7:
ret

---

Assembly w/ patch:
stp x29, x30, [sp, -32]!
mov x29, sp
str x0, [sp, 24]
cbz x0, .L1
adrpx0, .LC0
add x0, x0, :lo12:.LC0
bl  printf
ldr x1, [sp, 24]
str wzr, [x1]
.L1:
ldp x29, x30, [sp], 32
ret


As we can see above, w/o patch the test case gets shrink wrapped.

Input RTL to the LRA pass (the RTL is same both w/ and w/o patch):

BB2:
  set r95, x0
  set r92, r95
  if (r92 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(r92), 0
BB4:
  ret


Register assignment by IRA:
w/o patch:
  r92-->x19
  r95-->x0
  r94-->x0

w/ patch:
  r92-->x1
  r95-->x0
  r94-->x0


RTL after LRA:

w/o patch:
BB2:
  set x19, x0
  if (x19 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set mem(x19), 0
BB4:
  ret


w/ patch:
BB2:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4
BB3:
  set x0, symbol-ref("Hi")
  x0 = call printf
  set x1, mem(sp+24)
  set mem(x1), 0
BB4:
  ret


The difference between w/o patch and w/ patch is that w/o patch, a callee-save
register (x19) is chosen to hold the value of x0 (input parameter register). 
While
w/ patch, a caller-save register (x1) is chosen.

W/o patch, during the shrink wrap pass, first copy propagation is done and
the 'if' insn in BB2 is changed as follows:
  set x19, x0
  if (x19 eq 0) jump BB4

changed to:
  set x19, x0
  if (x0 eq 0) jump BB4   

Next, the insn "set x19, x0" is moved down the cfg to BB3. Since x19 is a
callee-save register, prolog gets generated in BB3 thereby resulting in
successful shrink wrapping.

W/ patch, during the shrink wrap pass, copy propagation changes BB2 as follows:
  set x1, x0
  set mem(sp+24), x1
  if (x1 eq 0) jump BB4

changed to:
  set x1, x0
  set mem(sp+24), x0
  if (x0 eq 0) jump BB4

However the store insn (set mem[sp+24], x0) cannot be moved down to BB3.
hence prolog gets generated in BB2 itself due to the use of 'sp'. Thereby
shrink wrap fails.

The store insn (which basically saves x1 to stack) is generated by the
LRA pass. This insn is needed because x1 is a caller-save register and we
have a call insn that will clobber this register. However, the store insn is 
generated
in the entry BB (BB2) instead of in BB3 which has the call insn. If the store
is generated in BB3, then the testcase will be shrink wrapped successfully.
In fact, it is more efficient if the store occurs only in the path containing
the printf call instead of occurring in the entry bb.

The reason why LRA generates the store insn in the entry bb is as follows:
LRA emits insns to save caller-save registers in the inheritance/splitting pass.
In this pass, LRA builds EBBs (Extended Basic Block) and traverses the insns in
the EBBs in reverse order from the last insn to the first insn. When LRA sees a
write to a pseudo (that has been assigned a caller-save register), and there is 
a
read following the write, with an intervening call insn between the write and 
read,
then LRA generates a spill immediately after the write and a restore immediately
before the read. The spill is needed because the call insn will clobber the
caller-save register.

In the above testcase, LRA forms two EBBs: the first EBB contains BB2 & BB3 
while
the second EBB contains BB4. 

In BB2, there is a write to x1 in the insn : 
set r92, r95 //r92 is assigned x1 and r95 is assigned x0

In BB3, there is a read of x1 after the call
insn.
set mem(r92), 0   // r92 is assigned x1

So LRA generates a spill in BB2 after the write to x1.

I have a patch (bootstrapped and regtested on powerpc) that makes changes in
LRA to save caller-save registers before a call instead of after the write to 
the
caller-save register. With this patch, both the above test gets successfully
shrink wrapped. After committing the patch for PR111673, I plan to get the 
LRA fix reviewed.

Please let me know if you need more information.

Regards,
Surya


On 14/12/23 9:41 pm, Richard Earnshaw (lists) wrote:
> On 14/12/2023 07:17, Surya Kumari Jangala via Gcc wrote:
>> Hi Richard,
>> Thanks a lot for your response!
>>
>> Another failure reported by the Linaro CI is as follows:
>>
>> Running gcc:gcc.dg/dg

Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2023-12-13 Thread Surya Kumari Jangala via Gcc
  (p[0-7])\.b, p15\.b
> **   mov z0\.d, \2/m, \1
> ** |
> **   mov (p[0-7])\.b, p15\.b
> **   add (x[0-9]+), x0, #?1
> **   mov z0\.d, \1/m, \2
> ** )
> 
> Note, we need to swap the match names in the third insn to account for the 
> different order of the earlier instructions.
> 
> Neither is ideal, but the second is perhaps a little more bomb proof.
> 
> I don't really have a strong feeling either way, but perhaps the second is 
> slightly preferable.
> 
> Richard S: thoughts?
> 
> R.
> 
>> I believe that this is fine and the test can be modified to allow it to pass 
>> on
>> aarch64. Please let me know what you think.
>>
>> Regards,
>> Surya
>>
>>
>> On 24/11/23 4:18 pm, Richard Earnshaw wrote:
>>>
>>>
>>> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>>>> Hi Richard,
>>>> Ping. Please let me know if the test failure that I mentioned in the mail 
>>>> below can be handled by changing the expected generated code. I am not 
>>>> conversant with arm, and hence would appreciate your help.
>>>>
>>>> Regards,
>>>> Surya
>>>>
>>>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>>>> Hi Richard,
>>>>> I had submitted a patch for review 
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>>>> regarding scaling save/restore costs of callee save registers with block
>>>>> frequency in the IRA pass (PR111673).
>>>>>
>>>>> This patch has been approved by VMakarov
>>>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>>>
>>>>> With this patch, we are seeing performance improvements with spec on x86
>>>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>>>
>>>>> I received a mail from Linaro about some failures seen in the CI pipeline 
>>>>> with
>>>>> this patch. I have analyzed the failures and I wish to discuss the 
>>>>> analysis with you.
>>>>>
>>>>> One failure reported by the Linaro CI is:
>>>>>
>>>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
>>>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>>>
>>>>> The diff in the assembly between trunk and patch is:
>>>>>
>>>>> 93c93
>>>>> <   push    {r4, r5}
>>>>> ---
>>>>>>     push    {fp}
>>>>> 95c95
>>>>> <   ldrexd  r4, r5, [r0]
>>>>> ---
>>>>>>     ldrexd  fp, ip, [r0]
>>>>> 99c99
>>>>> <   pop {r4, r5}
>>>>> ---
>>>>>>     ldr fp, [sp], #4
>>>>>
>>>>>
>>>>> The test fails with patch because the ldrexd insn uses fp & ip registers 
>>>>> instead
>>>>> of r[0-9]+
>>>>>
>>>>> But the code produced by patch is better because it is pushing and 
>>>>> restoring only
>>>>> one register (fp) instead of two registers (r4, r5). Hence, this test can 
>>>>> be
>>>>> modified to allow it to pass on arm. Please let me know what you think.
>>>>>
>>>>> If you need more information, please let me know. I will be sending 
>>>>> separate mails
>>>>> for the other test failures.
>>>>>
>>>
>>> Thanks for looking at this.
>>>
>>>
>>> The key part of this test is that the compiler generates LDREXD.  The 
>>> registers used for that are pretty much irrelevant as we don't match them 
>>> to any other operations within the test.  So I'd recommend just testing for 
>>> the mnemonic and not for any of the operands (ie just match "ldrexd\t").
>>>
>>> R.
>>>
>>>>> Regards,
>>>>> Surya
>>>>>
>>>>>
>>>>>


Re: Discussion about arm/aarch64 testcase failures seen with patch for PR111673

2023-11-28 Thread Surya Kumari Jangala via Gcc
Hi Richard,
Thanks a lot for your response!

Another failure reported by the Linaro CI is as follows :
(Note: I am planning to send a separate mail for each failure, as this will make
the discussion easy to track)

FAIL: gcc.target/aarch64/sve/acle/general/cpy_1.c -march=armv8.2-a+sve 
-moverride=tune=none  check-function-bodies dup_x0_m 

Expected code:

  ...
  add (x[0-9]+), x0, #?1
  mov (p[0-7])\.b, p15\.b
  mov z0\.d, \2/m, \1
  ...
  ret


Code obtained w/o patch:
addvl   sp, sp, #-1
str p15, [sp]
add x0, x0, 1
mov p3.b, p15.b
mov z0.d, p3/m, x0
ldr p15, [sp]
addvl   sp, sp, #1
ret

Code obtained w/ patch:
addvl   sp, sp, #-1
str p15, [sp]
mov p3.b, p15.b
add x0, x0, 1
mov z0.d, p3/m, x0
ldr p15, [sp]
addvl   sp, sp, #1
ret

As we can see, with the patch, the following two instructions are interchanged:
add x0, x0, 1
mov p3.b, p15.b

I believe that this is fine and the test can be modified to allow it to pass on
aarch64. Please let me know what you think.

Regards,
Surya


On 24/11/23 4:18 pm, Richard Earnshaw wrote:
> 
> 
> On 24/11/2023 08:09, Surya Kumari Jangala via Gcc wrote:
>> Hi Richard,
>> Ping. Please let me know if the test failure that I mentioned in the mail 
>> below can be handled by changing the expected generated code. I am not 
>> conversant with arm, and hence would appreciate your help.
>>
>> Regards,
>> Surya
>>
>> On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
>>> Hi Richard,
>>> I had submitted a patch for review 
>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
>>> regarding scaling save/restore costs of callee save registers with block
>>> frequency in the IRA pass (PR111673).
>>>
>>> This patch has been approved by VMakarov
>>> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
>>>
>>> With this patch, we are seeing performance improvements with spec on x86
>>> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
>>>
>>> I received a mail from Linaro about some failures seen in the CI pipeline 
>>> with
>>> this patch. I have analyzed the failures and I wish to discuss the analysis 
>>> with you.
>>>
>>> One failure reported by the Linaro CI is:
>>>
>>> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
>>> r[0-9]+, \\[r[0-9]+\\] 2
>>>
>>> The diff in the assembly between trunk and patch is:
>>>
>>> 93c93
>>> <   push    {r4, r5}
>>> ---
>>>>    push    {fp}
>>> 95c95
>>> <   ldrexd  r4, r5, [r0]
>>> ---
>>>>    ldrexd  fp, ip, [r0]
>>> 99c99
>>> <   pop {r4, r5}
>>> ---
>>>>    ldr fp, [sp], #4
>>>
>>>
>>> The test fails with patch because the ldrexd insn uses fp & ip registers 
>>> instead
>>> of r[0-9]+
>>>
>>> But the code produced by patch is better because it is pushing and 
>>> restoring only
>>> one register (fp) instead of two registers (r4, r5). Hence, this test can be
>>> modified to allow it to pass on arm. Please let me know what you think.
>>>
>>> If you need more information, please let me know. I will be sending 
>>> separate mails
>>> for the other test failures.
>>>
> 
> Thanks for looking at this.
> 
> 
> The key part of this test is that the compiler generates LDREXD.  The 
> registers used for that are pretty much irrelevant as we don't match them to 
> any other operations within the test.  So I'd recommend just testing for the 
> mnemonic and not for any of the operands (ie just match "ldrexd\t").
> 
> R.
> 
>>> Regards,
>>> Surya
>>>
>>>
>>>


Re: Discussion about arm testcase failures seen with patch for PR111673

2023-11-24 Thread Surya Kumari Jangala via Gcc
Hi Richard,
Ping. Please let me know if the test failure that I mentioned in the mail below 
can be handled by changing the expected generated code. I am not conversant 
with arm, and hence would appreciate your help.

Regards,
Surya

On 03/11/23 4:58 pm, Surya Kumari Jangala wrote:
> Hi Richard,
> I had submitted a patch for review 
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
> regarding scaling save/restore costs of callee save registers with block
> frequency in the IRA pass (PR111673).
> 
> This patch has been approved by VMakarov 
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).
> 
> With this patch, we are seeing performance improvements with spec on x86 
> (exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).
> 
> I received a mail from Linaro about some failures seen in the CI pipeline with
> this patch. I have analyzed the failures and I wish to discuss the analysis 
> with you.
> 
> One failure reported by the Linaro CI is:
> 
> FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, 
> r[0-9]+, \\[r[0-9]+\\] 2
> 
> The diff in the assembly between trunk and patch is:
> 
> 93c93
> <   push{r4, r5}
> ---
>>   push{fp}
> 95c95
> <   ldrexd  r4, r5, [r0]
> ---
>>   ldrexd  fp, ip, [r0]
> 99c99
> <   pop {r4, r5}
> ---
>>   ldr fp, [sp], #4
> 
> 
> The test fails with patch because the ldrexd insn uses fp & ip registers 
> instead
> of r[0-9]+
> 
> But the code produced by patch is better because it is pushing and restoring 
> only
> one register (fp) instead of two registers (r4, r5). Hence, this test can be
> modified to allow it to pass on arm. Please let me know what you think.
> 
> If you need more information, please let me know. I will be sending separate 
> mails
> for the other test failures.
> 
> Regards,
> Surya
> 
> 
> 


Discussion about arm testcase failures seen with patch for PR111673

2023-11-03 Thread Surya Kumari Jangala via Gcc
Hi Richard,
I had submitted a patch for review 
(https://gcc.gnu.org/pipermail/gcc-patches/2023-October/631849.html)
regarding scaling save/restore costs of callee save registers with block
frequency in the IRA pass (PR111673).

This patch has been approved by VMakarov 
(https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632089.html).

With this patch, we are seeing performance improvements with spec on x86 
(exchange: 5%, xalancbmk: 2.5%) and on Power (perlbench: 5.57%).

I received a mail from Linaro about some failures seen in the CI pipeline with
this patch. I have analyzed the failures and I wish to discuss the analysis 
with you.

One failure reported by the Linaro CI is:

FAIL: gcc.target/arm/pr111235.c scan-assembler-times ldrexd\tr[0-9]+, r[0-9]+, 
\\[r[0-9]+\\] 2

The diff in the assembly between trunk and patch is:

93c93
<   push{r4, r5}
---
>   push{fp}
95c95
<   ldrexd  r4, r5, [r0]
---
>   ldrexd  fp, ip, [r0]
99c99
<   pop {r4, r5}
---
>   ldr fp, [sp], #4


The test fails with patch because the ldrexd insn uses fp & ip registers instead
of r[0-9]+

But the code produced by patch is better because it is pushing and restoring 
only
one register (fp) instead of two registers (r4, r5). Hence, this test can be
modified to allow it to pass on arm. Please let me know what you think.

If you need more information, please let me know. I will be sending separate 
mails
for the other test failures.

Regards,
Surya





[PATCH] ira: Consider save/restore costs of callee-save registers [PR110071]

2023-09-14 Thread Surya Kumari Jangala via Gcc-patches
ira: Consider save/restore costs of callee-save registers [PR110071]

In improve_allocation() routine, IRA checks for each allocno if spilling
any conflicting allocnos can improve the allocation of this allocno.
This routine computes the cost improvement for usage of each profitable
hard register for a given allocno. The existing code in
improve_allocation() does not consider the save/restore costs of callee
save registers while computing the cost improvement.

This can result in a callee save register being assigned to a pseudo
that is live in the entire function and across a call, overriding a
non-callee save register assigned to the pseudo by graph coloring. So
the entry basic block requires a prolog, thereby causing shrink wrap to
fail.

2023-09-14  Surya Kumari Jangala  

gcc/
PR rtl-optimization/110071
* ira-color.cc (improve_allocation): Consider cost of callee
save registers.

gcc/testsuite/
PR rtl-optimization/110071
* gcc.target/powerpc/pr110071.c: New test.
---

diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index 5807d6d26f6..f2e8ea34152 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -3150,13 +3150,15 @@ improve_allocation (void)
   int j, k, n, hregno, conflict_hregno, base_cost, class_size, word, nwords;
   int check, spill_cost, min_cost, nregs, conflict_nregs, r, best;
   bool try_p;
-  enum reg_class aclass;
+  enum reg_class aclass, rclass;
   machine_mode mode;
   int *allocno_costs;
   int costs[FIRST_PSEUDO_REGISTER];
   HARD_REG_SET conflicting_regs[2], profitable_hard_regs;
   ira_allocno_t a;
   bitmap_iterator bi;
+  int saved_nregs;
+  int add_cost;
 
   /* Don't bother to optimize the code with static chain pointer and
  non-local goto in order not to spill the chain pointer
@@ -3194,6 +3196,7 @@ improve_allocation (void)
  conflicting_regs,
  _hard_regs);
   class_size = ira_class_hard_regs_num[aclass];
+  mode = ALLOCNO_MODE (a);
   /* Set up cost improvement for usage of each profitable hard
 register for allocno A.  */
   for (j = 0; j < class_size; j++)
@@ -3207,6 +3210,22 @@ improve_allocation (void)
  costs[hregno] = (allocno_costs == NULL
   ? ALLOCNO_UPDATED_CLASS_COST (a) : allocno_costs[k]);
  costs[hregno] -= allocno_copy_cost_saving (a, hregno);
+
+ if ((saved_nregs = calculate_saved_nregs (hregno, mode)) != 0)
+ {
+   /* We need to save/restore the hard register in
+  epilogue/prologue.  Therefore we increase the cost.
+  Since the prolog is placed in the entry BB, the frequency
+  of the entry BB is considered while computing the cost.  */
+   rclass = REGNO_REG_CLASS (hregno);
+   add_cost = ((ira_memory_move_cost[mode][rclass][0]
++ ira_memory_move_cost[mode][rclass][1])
+   * saved_nregs / hard_regno_nregs (hregno,
+ mode) - 1)
+  * REG_FREQ_FROM_BB (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+   costs[hregno] += add_cost;
+ }
+
  costs[hregno] -= base_cost;
  if (costs[hregno] < 0)
try_p = true;
diff --git a/gcc/testsuite/gcc.target/powerpc/pr110071.c 
b/gcc/testsuite/gcc.target/powerpc/pr110071.c
new file mode 100644
index 000..ec03fecfb15
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr110071.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+
+/* Verify there is an early return without the prolog and shrink-wrap
+   the function. */
+void bar ();
+long
+foo (long i, long cond)
+{
+  if (cond)
+bar ();
+  return i+1;
+}
+
+/* { dg-final { scan-rtl-dump-times "Performing shrink-wrapping" 1 
"pro_and_epilogue" } } */


[PATCH v2] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-09-10 Thread Surya Kumari Jangala via Gcc-patches
swap: Fix incorrect lane extraction by vec_extract() [PR106770]

In the routine rs6000_analyze_swaps(), special handling of swappable
instructions is done even if the webs that contain the swappable
instructions are not optimized, i.e., the webs do not contain any
permuting load/store instructions along with the associated register
swap instructions. Doing special handling in such webs will result in
the extracted lane being adjusted unnecessarily for vec_extract.

Another issue is that existing code treats non-permuting loads/stores
as special swappables. Non-permuting loads/stores (that have not yet
been split into a permuting load/store and a swap) are handled by
converting them into a permuting load/store (which effectively removes
the swap). As a result, if special swappables are handled only in webs
containing permuting loads/stores, then non-optimal code is generated
for non-permuting loads/stores.

Hence, in this patch, all webs containing either permuting loads/
stores or non-permuting loads/stores are marked as requiring special
handling of swappables. Swaps associated with permuting loads/stores
are marked for removal, and non-permuting loads/stores are converted to
permuting loads/stores. Then the special swappables in the webs are
fixed up.

Another issue with always handling swappable instructions is that it is
incorrect to do so in webs where loads/stores on quad word aligned
addresses are changed to lvx/stvx. Similarly, in webs where
swap(load(vector constant)) instructions are replaced with
load(swapped vector constant), the swappable instructions should not be
modified.

2023-09-10  Surya Kumari Jangala  

gcc/
PR rtl-optimization/PR106770
* config/rs6000/rs6000-p8swap.cc (non_permuting_mem_insn): New
function.
(handle_non_permuting_mem_insn): New function.
(rs6000_analyze_swaps): Handle swappable instructions only in
certain webs.
(web_requires_special_handling): New instance variable.
(handle_special_swappables): Remove handling of non-permuting
load/store instructions.

gcc/testsuite/
PR rtl-optimization/PR106770
* gcc.target/powerpc/pr106770.c: New test.
---

diff --git a/gcc/config/rs6000/rs6000-p8swap.cc 
b/gcc/config/rs6000/rs6000-p8swap.cc
index 0388b9bd736..3a695aa1318 100644
--- a/gcc/config/rs6000/rs6000-p8swap.cc
+++ b/gcc/config/rs6000/rs6000-p8swap.cc
@@ -179,6 +179,13 @@ class swap_web_entry : public web_entry_base
   unsigned int special_handling : 4;
   /* Set if the web represented by this entry cannot be optimized.  */
   unsigned int web_not_optimizable : 1;
+  /* Set if the swappable insns in the web represented by this entry
+ have to be fixed. Swappable insns have to be fixed in :
+   - webs containing permuting loads/stores and the swap insns
+in such webs have been marked for removal
+   - webs where non-permuting loads/stores have been converted
+to permuting loads/stores  */
+  unsigned int web_requires_special_handling : 1;
   /* Set if this insn should be deleted.  */
   unsigned int will_delete : 1;
 };
@@ -1468,14 +1475,6 @@ handle_special_swappables (swap_web_entry *insn_entry, 
unsigned i)
   if (dump_file)
fprintf (dump_file, "Adjusting subreg in insn %d\n", i);
   break;
-case SH_NOSWAP_LD:
-  /* Convert a non-permuting load to a permuting one.  */
-  permute_load (insn);
-  break;
-case SH_NOSWAP_ST:
-  /* Convert a non-permuting store to a permuting one.  */
-  permute_store (insn);
-  break;
 case SH_EXTRACT:
   /* Change the lane on an extract operation.  */
   adjust_extract (insn);
@@ -2401,6 +2400,25 @@ recombine_lvx_stvx_patterns (function *fun)
   free (to_delete);
 }
 
+/* Return true if insn is a non-permuting load/store.  */
+static bool
+non_permuting_mem_insn (swap_web_entry *insn_entry, unsigned int i)
+{
+  return (insn_entry[i].special_handling == SH_NOSWAP_LD ||
+ insn_entry[i].special_handling == SH_NOSWAP_ST);
+}
+
+/* Convert a non-permuting load/store insn to a permuting one.  */
+static void
+handle_non_permuting_mem_insn (swap_web_entry *insn_entry, unsigned int i)
+{
+  rtx_insn *insn = insn_entry[i].insn;
+  if (insn_entry[i].special_handling == SH_NOSWAP_LD)
+permute_load (insn);
+  else if (insn_entry[i].special_handling == SH_NOSWAP_ST)
+permute_store (insn);
+}
+
 /* Main entry point for this pass.  */
 unsigned int
 rs6000_analyze_swaps (function *fun)
@@ -2624,25 +2642,56 @@ rs6000_analyze_swaps (function *fun)
   dump_swap_insn_table (insn_entry);
 }
 
-  /* For each load and store in an optimizable web (which implies
- the loads and stores are permuting), find the associated
- register swaps and mark them for removal.  Due to various
- optimizations we may mark the same swap more than once.  Also
- perform special handling for swappable insns that require it.  */
+  /* There are two kinds of optimizations 

[PING][PATCH] ira: update allocated_hardreg_p[] in improve_allocation() [PR110254]

2023-07-31 Thread Surya Kumari Jangala via Gcc-patches
Ping

On 21/07/23 3:43 pm, Surya Kumari Jangala via Gcc-patches wrote:
> The improve_allocation() routine does not update the
> allocated_hardreg_p[] array after an allocno is assigned a register.
> 
> If the register chosen in improve_allocation() is one that already has
> been assigned to a conflicting allocno, then allocated_hardreg_p[]
> already has the corresponding bit set to TRUE, so nothing needs to be
> done.
> 
> But improve_allocation() can also choose a register that has not been
> assigned to a conflicting allocno, and also has not been assigned to any
> other allocno. In this case, allocated_hardreg_p[] has to be updated.
> 
> 2023-07-21  Surya Kumari Jangala  
> 
> gcc/
>   PR rtl-optimization/PR110254
>   * ira-color.cc (improve_allocation): Update array
> ---
> 
> diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
> index 1fb2958bddd..5807d6d26f6 100644
> --- a/gcc/ira-color.cc
> +++ b/gcc/ira-color.cc
> @@ -3340,6 +3340,10 @@ improve_allocation (void)
>   }
>/* Assign the best chosen hard register to A.  */
>ALLOCNO_HARD_REGNO (a) = best;
> +
> +  for (j = nregs - 1; j >= 0; j--)
> + allocated_hardreg_p[best + j] = true;
> +
>if (internal_flag_ira_verbose > 2 && ira_dump_file != NULL)
>   fprintf (ira_dump_file, "Assigning %d to a%dr%d\n",
>best, ALLOCNO_NUM (a), ALLOCNO_REGNO (a));


[PATCH] ira: update allocated_hardreg_p[] in improve_allocation() [PR110254]

2023-07-21 Thread Surya Kumari Jangala via Gcc-patches
The improve_allocation() routine does not update the
allocated_hardreg_p[] array after an allocno is assigned a register.

If the register chosen in improve_allocation() is one that already has
been assigned to a conflicting allocno, then allocated_hardreg_p[]
already has the corresponding bit set to TRUE, so nothing needs to be
done.

But improve_allocation() can also choose a register that has not been
assigned to a conflicting allocno, and also has not been assigned to any
other allocno. In this case, allocated_hardreg_p[] has to be updated.

2023-07-21  Surya Kumari Jangala  

gcc/
PR rtl-optimization/PR110254
* ira-color.cc (improve_allocation): Update array
---

diff --git a/gcc/ira-color.cc b/gcc/ira-color.cc
index 1fb2958bddd..5807d6d26f6 100644
--- a/gcc/ira-color.cc
+++ b/gcc/ira-color.cc
@@ -3340,6 +3340,10 @@ improve_allocation (void)
}
   /* Assign the best chosen hard register to A.  */
   ALLOCNO_HARD_REGNO (a) = best;
+
+  for (j = nregs - 1; j >= 0; j--)
+   allocated_hardreg_p[best + j] = true;
+
   if (internal_flag_ira_verbose > 2 && ira_dump_file != NULL)
fprintf (ira_dump_file, "Assigning %d to a%dr%d\n",
 best, ALLOCNO_NUM (a), ALLOCNO_REGNO (a));


Re: [PATCH v2] rs6000: fmr gets used instead of faster xxlor [PR93571]

2023-06-14 Thread Surya Kumari Jangala via Gcc-patches



On 25/02/23 3:20 pm, Ajit Agarwal via Gcc-patches wrote:
> Hello All:
> 
> Here is the patch that uses xxlor instead of fmr where possible.
> Performance results shows that fmr is better in power9 and 
> power10 architectures whereas xxlor is better in power7 and
> power 8 architectures. fmr is the only option before p7.
> 
> Bootstrapped and regtested on powerpc64-linux-gnu
> 
> Thanks & Regards
> Ajit
> 
>   rs6000: Use xxlor instead of fmr where possible
> 
>   Replaces fmr with xxlor instruction for power7 and power8
>   architectures whereas for power9 and power10 keep fmr
>   instruction.
> 
>   Perf measurement results:
> 
>   Power9 fmr:  201,847,661 cycles.
>   Power9 xxlor: 201,877,78 cycles.
>   Power8 fmr: 200,901,043 cycles.
>   Power8 xxlor: 201,020,518 cycles.

'fmr' is better than 'xxlor' for power8 according to the above numbers. Should 
we then replace fmr with xxlor?

-Surya


Re: [PATCH v4 3/4] ree: Main functionality to improve ree pass for rs6000 target.

2023-04-24 Thread Surya Kumari Jangala via Gcc-patches



On 21/04/23 8:51 pm, Ajit Agarwal via Gcc-patches wrote:

> +/* Return TRUE if the cfg has following properties.
> + bb1
> + |\
> + | \
> + |  bb2
> + |  /
> + bb3
> +
> +   whereas bb1 has IF_THEN_ELSE  and bb2 has the definition and bb3 has
> +   zero/sign/AND extensions.  */
> +

Any specific reason for requiring CFGs to have only this particular shape? The 
patch should be generic enough to work for all CFGs.

Regards,
Surya

> +static bool
> +feasible_cfg (ext_cand *cand, rtx_insn *def_insn)
> +{
> +  basic_block bb = BLOCK_FOR_INSN (cand->insn);
> +  edge fallthru_edge;
> +  edge e;
> +  edge_iterator ei;
> +
> +  FOR_EACH_EDGE (e, ei, bb->preds)
> +{
> +  rtx_insn *insn = BB_END (e->src) ? PREV_INSN (BB_END (e->src)) : NULL;
> +
> +  if (insn == NULL)
> + continue;
> +
> +  if (DEBUG_INSN_P (insn))
> + continue;
> +
> +  rtx set = single_set (insn);
> +
> +  /* Block has IF_THEN_ELSE  */
> +  if (insn && set
> +   && GET_CODE (set) == SET && SET_SRC (set)
> +   && GET_CODE (SET_SRC (set)) == IF_THEN_ELSE)
> + {
> +   if (e->dest == bb)
> + {
> +   basic_block jump_block = e->dest;
> +   if (jump_block != bb)
> + return false;
> +  }
> +  else
> +{
> +  /* def_insn block has single successor and fall through
> + edge target are the block for cand insn.  */
> +  if (single_succ_p (e->dest))
> +{
> +  fallthru_edge = single_succ_edge (e->dest);
> +  if (BB_END (fallthru_edge->dest)
> +  && bb != fallthru_edge->dest)
> +return false;
> +}
> + }
> +   }
> +}
> +
> +  /* def_insn block has single successor and fall through
> + edge target are the block for cand insn.  */
> +  if (single_succ_p (BLOCK_FOR_INSN (def_insn)))
> +{
> +  fallthru_edge = single_succ_edge (BLOCK_FOR_INSN (def_insn));
> +  if (BB_END (fallthru_edge->dest)
> +   && bb != fallthru_edge->dest)
> + return false;
> +}
> +   else
> + return false;
> +
> +  return true;
> +}
> +
> +/* Return TRUE if the candidate extension INSN and def_insn are
> +   feasible for extension elimination.
> +
> +   Things to consider:
> +
> +   cfg properties are feasible for extension elimination.
> +
> +   sign_extend with def insn as PLUS and the reaching definition
> +   of def_insn are not ASHIFT and LSHIFTRT.
> +
> +   zero_extend with def insn as XOR/IOR and the reachin definition
> +   of def_insn are not ASHIFT and LSHIFTRT.
> +
> +   The destination register of the extension insn must not be
> +   used or set between the def_insn and cand->insn exclusive.
> +
> +   AND with zero extension properties has USE and the register
> +   of cand insn are same as register of USE operand.  */
> +
> +static bool
> +eliminate_across_bbs_p (ext_cand *cand, rtx_insn *def_insn)
> +{
> +  basic_block bb = BLOCK_FOR_INSN (cand->insn);
> +
> +  if (!feasible_cfg (cand, def_insn))
> +return false;
> +
> +  rtx cand_set = single_set(cand->insn);
> +  /* The destination register of the extension insn must not be
> +  used or set between the def_insn and cand->insn exclusive.  */
> +  if (INSN_CHAIN_CODE_P (GET_CODE (def_insn))
> +  && INSN_CHAIN_CODE_P (cand->code))
> +if ((cand->code == ZERO_EXTEND)
> +  && REG_P (SET_DEST (cand_set)) && NEXT_INSN (def_insn)
> +  && reg_used_set_between_p(SET_DEST (cand_set), def_insn, cand->insn))
> +  return false;
> +
> +  if (cand->code == ZERO_EXTEND
> +  && (bb != BLOCK_FOR_INSN (def_insn)
> +  || DF_INSN_LUID (def_insn) > DF_INSN_LUID (cand->insn)))
> +return false;
> +
> +  if (rtx_is_zext_p (cand->insn))
> +{
> +  if (GET_CODE (PATTERN (BB_END (bb))) != USE)
> + return false;
> +
> +  if (REGNO (XEXP (PATTERN (BB_END (bb)), 0)) != REGNO (SET_DEST 
> (cand->expr)))
> + return false;
> +}
> +
> +  rtx set = single_set (def_insn);
> +
> +  if (!set)
> +return false;
> +
> +  if (cand->code == SIGN_EXTEND
> +  && GET_CODE (set) == SET)
> +{
> +  rtx orig_src = SET_SRC (set);
> +  machine_mode ext_src_mode;
> +
> +  ext_src_mode = GET_MODE (XEXP (SET_SRC (cand->expr), 0));
> +
> +  if (GET_MODE (SET_DEST (set)) != ext_src_mode)
> + return false;
> +
> +  if (GET_CODE (orig_src) != PLUS)
> + return false;
> +
> +  if (!REG_P (XEXP (orig_src, 0)))
> + return false;
> +
> +  if (!REG_P (XEXP (orig_src,1)))
> + return false;
> +
> +  if (GET_CODE (orig_src) == PLUS)
> + {
> +   bool def_src1
> + = def_arith_p (def_insn,
> +XEXP (SET_SRC (set), 0));
> +   bool def_src2
> + = def_arith_p (def_insn,
> +XEXP (SET_SRC (set), 1));
> +
> +   if (def_src1 || def_src2)
> + return false;
> + }
> +  

[IRA] Query about improve_allocation() routine in IRA

2023-04-19 Thread Surya Kumari Jangala via Gcc
Hi Vladimir,
I have been analyzing a test case for which shrink wrapping fails 
(on powerpc, 64bit LE). But if the same test case is slightly modified,
shrink wrapping kicks in.

Here are the two tests:

Test1 (shrink wrapped)

long
foo (long i, long cond)
{
  i = i + 1;
  if (cond)
bar ();
  return i;
}


Test2 (not shrink wrapped)

long
foo (long i, long cond)
{
  if (cond)
bar ();
  return i+1;
}

The difference between the two tests is when ‘i’ is incremented.

There is a difference in register allocation by IRA in the two cases.

Input RTL to IRA (Test1: passing case)

BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r117, r122 + 1
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r117
  return r3

   
Input RTL to IRA (Test2: failing case)

BB2:
  set r123, r4
  set r122, r3
  set r120, compare(r123, 0)
  set r118, r122
  if r120 jump BB4 else jump BB3
BB3:
  call bar()
BB4:
  set r3, r118+1
  return r3


There is a difference in registers allocated for the pseudo register
r117 (passing case) and pseudo register r118 (failing case) by IRA.
r117 is allocated r3 while r118 is allocated r31.
Since r117 is allocated r3, r3 is spilled across the call to bar() by LRA.
And so only BB3 requires a prolog and shrink wrap is successful.
In the failing case, since r31 is assigned to r118, BB2 requires a prolog
and shrink wrap fails.

In the IRA pass, after graph coloring, both r117 and r118 get assigned
to r3. However, the routine improve_allocation() assigns r31 to r118.
The routine improve_allocation() is called after graph coloring. In this
routine, IRA checks for each allocno if spilling any conflicting allocnos
can improve the allocation of this allocno.

Going into more detail, improve_allocation() does the following:
Step 1: We first compute the cost improvement for usage of each profitable
hard register for a given allocno A. The cost improvement is computed as
follows:

costs[regno] = A->hard_reg_costs[regno]   // ‘hard_reg_costs’ is an array
  of usage costs for each hard register
costs[regno] -= allocno_copy_cost_saving (A, regno);
costs[regno] -= base_cost;   //Say, ‘reg’ is assigned to A. Then
  ‘base_cost’ is the usage cost of ‘reg’ for A.

Step 2: Then we process each conflicting allocno of A and update the cost
improvement for the profitable hard registers of A. Basically, we compute
the spill costs of the conflicting allocnos and update the cost (for A) of
the register that was assigned to the conflicting allocno. 

Step 3: We then find the best register among the profitable registers, spill
the conflicting allocno that uses this best register and assign the best
register to A.


However, the initial hard register costs for some of the profitable hard
registers is different in the passing and failing test cases. More
specifically, the costs in hard_reg_costs[] array are 0 for regs 14-31
in the failing case. In the passing case, the costs in hard_reg_costs[]
for regs 14-31 is 2000. 

At the end of step 1, costs[r31] is -390 for failing case(for allocno r118)
and 1610 for passing case (for allocno r117).

In step 2 for the failing test, the only conflicting allocno for r118 is
the allocno for r120 which is used to hold the value of the compare operation.
The pseudo r120 has been assigned to r100 by the graph coloring step. But
r100 is not in the set of profitable hard registers for r118. (The profitable
hard regs are: [0, 3-12, 14-31]). So the allocno for r120 is not considered
for spilling. And finally in step 3, r31 is assigned to r118.

I have a few queries:

1. A zero cost seems strange for the regs r14-r31. If using a reg in the
set [14..31] has zero cost, then why wasn’t such a reg chosen for r118
in the first place, instead of r3? 

2. In step 3, shouldn’t we restrict the register chosen to be a register
that has been assigned to a conflicting allocno? This is not the case for
the failing test. The allocno for r118 is assigned r31, but there is no
conflicting allocno that has been assigned r31.

3. In steps 1 & 2, shouldn’t we consider the register save and restore
cost too? ’r31’ being a callee-save (non-volatile) register has to be
saved before being used, whereas this is not required for r3 which is a
caller-save (volatile) register. 

Thanks in advance,
Surya


Re: [PATCH] rs6000: suboptimal code for returning bool value on target ppc

2023-03-16 Thread Surya Kumari Jangala via Gcc-patches
The issue of suboptimal code exists even for integer return value and not just 
bool return value. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784#c9 
So the patch would need to take care of integer return values too.

On 16/03/23 10:50 am, Ajit Agarwal via Gcc-patches wrote:
> Hello All:
> 
> 
> This patch eliminates unnecessary zero extension instruction from power 
> generated assembly.
> Bootstrapped and regtested on powerpc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
>   rs6000: suboptimal code for returning bool value on target ppc.
> 
>   New pass to eliminate unnecessary zero extension. This pass
>   is registered after cse rtl pass.
> 
>   2023-03-16  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-passes.def: Registered zero elimination
>   pass.
>   * config/rs6000/rs6000-zext-elim.cc: Add new pass.
>   * config.gcc: Add new executable.
>   * config/rs6000/rs6000-protos.h: Add new prototype for zero
>   elimination pass.
>   * config/rs6000/rs6000.cc: Add new prototype for zero
>   elimination pass.
>   * config/rs6000/t-rs6000: Add new rule.
>   * expr.cc: Modified gcc assert.
>   * explow.cc: Modified gcc assert.
>   * optabs.cc: Modified gcc assert.
> ---
>  gcc/config.gcc|   4 +-
>  gcc/config/rs6000/rs6000-passes.def   |   2 +
>  gcc/config/rs6000/rs6000-protos.h |   1 +
>  gcc/config/rs6000/rs6000-zext-elim.cc | 361 ++
>  gcc/config/rs6000/rs6000.cc   |   2 +
>  gcc/config/rs6000/t-rs6000|   5 +
>  gcc/explow.cc |   3 +-
>  gcc/expr.cc   |   4 +-
>  gcc/optabs.cc |   3 +-
>  9 files changed, 379 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-zext-elim.cc
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index da3a6d3ba1f..e8ac9d882f0 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -503,7 +503,7 @@ or1k*-*-*)
>   ;;
>  powerpc*-*-*)
>   cpu_type=rs6000
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-zext-elim.o 
> rs6000-logue.o"
>   extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>   extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
>   extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
> @@ -538,7 +538,7 @@ riscv*)
>   ;;
>  rs6000*-*-*)
>   extra_options="${extra_options} g.opt fused-madd.opt 
> rs6000/rs6000-tables.opt"
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-zext-elim.o 
> rs6000-logue.o"
>   extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>   target_gtfiles="$target_gtfiles 
> \$(srcdir)/config/rs6000/rs6000-logue.cc 
> \$(srcdir)/config/rs6000/rs6000-call.cc"
>   target_gtfiles="$target_gtfiles 
> \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
> diff --git a/gcc/config/rs6000/rs6000-passes.def 
> b/gcc/config/rs6000/rs6000-passes.def
> index ca899d5f7af..d7500feddf1 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -28,6 +28,8 @@ along with GCC; see the file COPYING3.  If not see
>   The power8 does not have instructions that automaticaly do the byte 
> swaps
>   for loads and stores.  */
>INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> +  INSERT_PASS_AFTER (pass_cse, 1, pass_analyze_zext);
> +
>  
>/* Pass to do the PCREL_OPT optimization that combines the load of an
>   external symbol's address along with a single load or store using that
> diff --git a/gcc/config/rs6000/rs6000-protos.h 
> b/gcc/config/rs6000/rs6000-protos.h
> index 1a4fc1df668..f6cf2d673d4 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -340,6 +340,7 @@ namespace gcc { class context; }
>  class rtl_opt_pass;
>  
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
> +extern rtl_opt_pass *make_pass_analyze_zext (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
> diff --git a/gcc/config/rs6000/rs6000-zext-elim.cc 
> b/gcc/config/rs6000/rs6000-zext-elim.cc
> new file mode 100644
> index 000..777c7a5a387
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-zext-elim.cc
> @@ -0,0 +1,361 @@
> +/* Subroutine to eliminate redundant zero extend for power architecture.
> +   Copyright (C) 1991-2023 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.

Re: [PATCH] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-03-03 Thread Surya Kumari Jangala via Gcc-patches



On 27/02/23 9:58 pm, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jan 04, 2023 at 01:58:19PM +0530, Surya Kumari Jangala wrote:
>> In the routine rs6000_analyze_swaps(), special handling of swappable
>> instructions is done even if the webs that contain the swappable
>> instructions are not optimized, i.e., the webs do not contain any
>> permuting load/store instructions along with the associated register
>> swap instructions. Doing special handling in such webs will result in
>> the extracted lane being adjusted unnecessarily for vec_extract.
>>
>> Modifying swappable instructions is also incorrect in webs where
>> loads/stores on quad word aligned addresses are changed to lvx/stvx.
>> Similarly, in webs where swap(load(vector constant)) instructions are
>> replaced with load(swapped vector constant), the swappable
>> instructions should not be modified.
>>
>> 2023-01-04  Surya Kumari Jangala  
>>
>> gcc/
>>  PR rtl-optimization/106770
>>  * rs6000-p8swap.cc (rs6000_analyze_swaps): .
> 
> Please add an entry?  Or multiple ones, actually.  Describe all changes.
Ok

> 
>> --- a/gcc/config/rs6000/rs6000-p8swap.cc
>> +++ b/gcc/config/rs6000/rs6000-p8swap.cc
>> @@ -179,6 +179,9 @@ class swap_web_entry : public web_entry_base
>>unsigned int special_handling : 4;
>>/* Set if the web represented by this entry cannot be optimized.  */
>>unsigned int web_not_optimizable : 1;
>> +  /* Set if the web represented by this entry has been optimized, ie,
> 
> s/ie/i.e./
> 
>> + register swaps of permuting loads/stores have been removed.  */
> 
> If it really means only exactly this, then the name isn't so good.

There is another bit in this class named "web_not_optimizable". This stands for 
webs that cannot be optimized. I am reusing this name for the new bit I am 
adding, and I have named this bit as "web_is_optimized".

> 
>> +  unsigned int web_is_optimized : 1;
> 
> And if it is as general as the name suggests, then the comment is no
> good.  Which is it?  :-)
> 
>>/* For each load and store in an optimizable web (which implies
>>   the loads and stores are permuting), find the associated
>>   register swaps and mark them for removal.  Due to various
>> - optimizations we may mark the same swap more than once.  Also
>> - perform special handling for swappable insns that require it.  */
>> + optimizations we may mark the same swap more than once. Fix up
>> + the non-permuting loads and stores by converting them into
>> + permuting ones.  */
> 
> Two spaces after a full stop is correct.  Please put that back.
Ok

> 
> Is it a good idea convert from/to swapping load/stores in this pass at
> all?  Doesdn't that belong elsewhere?  Like, in combine, where we
> already should do this.  Why does that not work> 
>> -if (!root_entry->web_not_optimizable)
>> +if (!root_entry->web_not_optimizable) {
> 
> Blocks start on a new line, indented.
Ok

> 
>>mark_swaps_for_removal (insn_entry, i);
>> +  root_entry->web_is_optimized = true;
> 
> Indent using tabs where possible.
Ok

> 
>> +swap_web_entry* root_entry
>> +  = (swap_web_entry*)((_entry[i])->unionfind_root ());
> 
> Space before *, in all cases. Space before the second (.  There are too
> many brackets here, too.
Ok

> 
>> +  /* Perform special handling for swappable insns that require it. 
> 
> No trailing spaces.
Ok

> 
>> + Note that special handling should be done only for those 
>> + swappable insns that are present in webs optimized above.  */
>> +  for (i = 0; i < e; ++i)
>> +if (insn_entry[i].is_swappable && insn_entry[i].special_handling &&
>> +!(insn_entry[i].special_handling == SH_NOSWAP_LD || 
>> +  insn_entry[i].special_handling == SH_NOSWAP_ST))
>>{
>>  swap_web_entry* root_entry
>>= (swap_web_entry*)((_entry[i])->unionfind_root ());
>> -if (!root_entry->web_not_optimizable)
>> +if (root_entry->web_is_optimized)
>>handle_special_swappables (insn_entry, i);
>>}
> 
> Why this change?

The swap pass analyzes vector computations and removes unnecessary doubleword 
swaps (xxswapdi instructions). The swap pass first constructs webs and removes 
swap instructions if possible. If the web contains operations that are 
sensitive to element order (ie, insns requiring special handling), such as an 
extract, then such instructions should be modified. For example, for an extract 
operation the lane is changed.
However, the swap pass is changing element order of an extract operation that 
is present in unoptimized webs. Unoptimized webs are those for which register 
swaps were not removed, one of the reasons being that there are no loads/stores 
present in this web. For such webs, element order of extract operation should 
not be changed.
Hence, we first mark webs that can be optimized, and only for such webs we call 
the routine handle_special_swappables() to modify operations sensitive to 
element order.

[PING 3] [PATCH] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-02-27 Thread Surya Kumari Jangala via Gcc-patches
Hello,

Ping https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609374.html

Thanks,

Surya

On 04/01/23 1:58 pm, Surya Kumari Jangala via Gcc-patches wrote:
> swap: Fix incorrect lane extraction by vec_extract() [PR106770]
> 
> In the routine rs6000_analyze_swaps(), special handling of swappable
> instructions is done even if the webs that contain the swappable
> instructions are not optimized, i.e., the webs do not contain any
> permuting load/store instructions along with the associated register
> swap instructions. Doing special handling in such webs will result in
> the extracted lane being adjusted unnecessarily for vec_extract.
> 
> Modifying swappable instructions is also incorrect in webs where
> loads/stores on quad word aligned addresses are changed to lvx/stvx.
> Similarly, in webs where swap(load(vector constant)) instructions are
> replaced with load(swapped vector constant), the swappable
> instructions should not be modified.
> 
> 2023-01-04  Surya Kumari Jangala  
> 
> gcc/
>   PR rtl-optimization/106770
>   * rs6000-p8swap.cc (rs6000_analyze_swaps): .
> 
> gcc/testsuite/
>   PR rtl-optimization/106770
>   * gcc.target/powerpc/pr106770.c: New test.
> ---
> 
> diff --git a/gcc/config/rs6000/rs6000-p8swap.cc 
> b/gcc/config/rs6000/rs6000-p8swap.cc
> index 19fbbfb67dc..7ed39251df9 100644
> --- a/gcc/config/rs6000/rs6000-p8swap.cc
> +++ b/gcc/config/rs6000/rs6000-p8swap.cc
> @@ -179,6 +179,9 @@ class swap_web_entry : public web_entry_base
>unsigned int special_handling : 4;
>/* Set if the web represented by this entry cannot be optimized.  */
>unsigned int web_not_optimizable : 1;
> +  /* Set if the web represented by this entry has been optimized, ie,
> + register swaps of permuting loads/stores have been removed.  */
> +  unsigned int web_is_optimized : 1;
>/* Set if this insn should be deleted.  */
>unsigned int will_delete : 1;
>  };
> @@ -2627,22 +2630,43 @@ rs6000_analyze_swaps (function *fun)
>/* For each load and store in an optimizable web (which implies
>   the loads and stores are permuting), find the associated
>   register swaps and mark them for removal.  Due to various
> - optimizations we may mark the same swap more than once.  Also
> - perform special handling for swappable insns that require it.  */
> + optimizations we may mark the same swap more than once. Fix up
> + the non-permuting loads and stores by converting them into
> + permuting ones.  */
>for (i = 0; i < e; ++i)
>  if ((insn_entry[i].is_load || insn_entry[i].is_store)
>   && insn_entry[i].is_swap)
>{
>   swap_web_entry* root_entry
> = (swap_web_entry*)((_entry[i])->unionfind_root ());
> - if (!root_entry->web_not_optimizable)
> + if (!root_entry->web_not_optimizable) {
> mark_swaps_for_removal (insn_entry, i);
> +  root_entry->web_is_optimized = true;
> +}
>}
> -else if (insn_entry[i].is_swappable && insn_entry[i].special_handling)
> +else if (insn_entry[i].is_swappable
> + && (insn_entry[i].special_handling == SH_NOSWAP_LD ||
> + insn_entry[i].special_handling == SH_NOSWAP_ST))
> +  {
> +swap_web_entry* root_entry
> +  = (swap_web_entry*)((_entry[i])->unionfind_root ());
> +if (!root_entry->web_not_optimizable) {
> +  handle_special_swappables (insn_entry, i);
> +  root_entry->web_is_optimized = true;
> +}
> +  }
> +
> +  /* Perform special handling for swappable insns that require it. 
> + Note that special handling should be done only for those 
> + swappable insns that are present in webs optimized above.  */
> +  for (i = 0; i < e; ++i)
> +if (insn_entry[i].is_swappable && insn_entry[i].special_handling &&
> +!(insn_entry[i].special_handling == SH_NOSWAP_LD || 
> +  insn_entry[i].special_handling == SH_NOSWAP_ST))
>{
>   swap_web_entry* root_entry
> = (swap_web_entry*)((_entry[i])->unionfind_root ());
> - if (!root_entry->web_not_optimizable)
> + if (root_entry->web_is_optimized)
> handle_special_swappables (insn_entry, i);
>}
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106770.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106770.c
> new file mode 100644
> index 000..84e9aead975
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106770.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power8 -O3 " } 

Re: [PING 2] [PATCH] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-02-16 Thread Surya Kumari Jangala via Gcc-patches
Ping. Please review the patch.

On 12/01/23 10:21 pm, Surya Kumari Jangala via Gcc-patches wrote:
> Ping
> 
> On 04/01/23 1:58 pm, Surya Kumari Jangala via Gcc-patches wrote:
>> swap: Fix incorrect lane extraction by vec_extract() [PR106770]
>>
>> In the routine rs6000_analyze_swaps(), special handling of swappable
>> instructions is done even if the webs that contain the swappable
>> instructions are not optimized, i.e., the webs do not contain any
>> permuting load/store instructions along with the associated register
>> swap instructions. Doing special handling in such webs will result in
>> the extracted lane being adjusted unnecessarily for vec_extract.
>>
>> Modifying swappable instructions is also incorrect in webs where
>> loads/stores on quad word aligned addresses are changed to lvx/stvx.
>> Similarly, in webs where swap(load(vector constant)) instructions are
>> replaced with load(swapped vector constant), the swappable
>> instructions should not be modified.
>>
>> 2023-01-04  Surya Kumari Jangala  
>>
>> gcc/
>>  PR rtl-optimization/106770
>>  * rs6000-p8swap.cc (rs6000_analyze_swaps): .
>>
>> gcc/testsuite/
>>  PR rtl-optimization/106770
>>  * gcc.target/powerpc/pr106770.c: New test.
>> ---
>>
>> diff --git a/gcc/config/rs6000/rs6000-p8swap.cc 
>> b/gcc/config/rs6000/rs6000-p8swap.cc
>> index 19fbbfb67dc..7ed39251df9 100644
>> --- a/gcc/config/rs6000/rs6000-p8swap.cc
>> +++ b/gcc/config/rs6000/rs6000-p8swap.cc
>> @@ -179,6 +179,9 @@ class swap_web_entry : public web_entry_base
>>unsigned int special_handling : 4;
>>/* Set if the web represented by this entry cannot be optimized.  */
>>unsigned int web_not_optimizable : 1;
>> +  /* Set if the web represented by this entry has been optimized, ie,
>> + register swaps of permuting loads/stores have been removed.  */
>> +  unsigned int web_is_optimized : 1;
>>/* Set if this insn should be deleted.  */
>>unsigned int will_delete : 1;
>>  };
>> @@ -2627,22 +2630,43 @@ rs6000_analyze_swaps (function *fun)
>>/* For each load and store in an optimizable web (which implies
>>   the loads and stores are permuting), find the associated
>>   register swaps and mark them for removal.  Due to various
>> - optimizations we may mark the same swap more than once.  Also
>> - perform special handling for swappable insns that require it.  */
>> + optimizations we may mark the same swap more than once. Fix up
>> + the non-permuting loads and stores by converting them into
>> + permuting ones.  */
>>for (i = 0; i < e; ++i)
>>  if ((insn_entry[i].is_load || insn_entry[i].is_store)
>>  && insn_entry[i].is_swap)
>>{
>>  swap_web_entry* root_entry
>>= (swap_web_entry*)((_entry[i])->unionfind_root ());
>> -if (!root_entry->web_not_optimizable)
>> +if (!root_entry->web_not_optimizable) {
>>mark_swaps_for_removal (insn_entry, i);
>> +  root_entry->web_is_optimized = true;
>> +}
>>}
>> -else if (insn_entry[i].is_swappable && insn_entry[i].special_handling)
>> +else if (insn_entry[i].is_swappable
>> + && (insn_entry[i].special_handling == SH_NOSWAP_LD ||
>> + insn_entry[i].special_handling == SH_NOSWAP_ST))
>> +  {
>> +swap_web_entry* root_entry
>> +  = (swap_web_entry*)((_entry[i])->unionfind_root ());
>> +if (!root_entry->web_not_optimizable) {
>> +  handle_special_swappables (insn_entry, i);
>> +  root_entry->web_is_optimized = true;
>> +}
>> +  }
>> +
>> +  /* Perform special handling for swappable insns that require it. 
>> + Note that special handling should be done only for those 
>> + swappable insns that are present in webs optimized above.  */
>> +  for (i = 0; i < e; ++i)
>> +if (insn_entry[i].is_swappable && insn_entry[i].special_handling &&
>> +!(insn_entry[i].special_handling == SH_NOSWAP_LD || 
>> +  insn_entry[i].special_handling == SH_NOSWAP_ST))
>>{
>>  swap_web_entry* root_entry
>>= (swap_web_entry*)((_entry[i])->unionfind_root ());
>> -if (!root_entry->web_not_optimizable)
>> +if (root_entry->web_is_optimized)
>>handle_special_swappables (insn_entry, i);
>>}
>>  
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106770.c 
>> 

[PING] [PATCH] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-01-12 Thread Surya Kumari Jangala via Gcc-patches
Ping

On 04/01/23 1:58 pm, Surya Kumari Jangala via Gcc-patches wrote:
> swap: Fix incorrect lane extraction by vec_extract() [PR106770]
> 
> In the routine rs6000_analyze_swaps(), special handling of swappable
> instructions is done even if the webs that contain the swappable
> instructions are not optimized, i.e., the webs do not contain any
> permuting load/store instructions along with the associated register
> swap instructions. Doing special handling in such webs will result in
> the extracted lane being adjusted unnecessarily for vec_extract.
> 
> Modifying swappable instructions is also incorrect in webs where
> loads/stores on quad word aligned addresses are changed to lvx/stvx.
> Similarly, in webs where swap(load(vector constant)) instructions are
> replaced with load(swapped vector constant), the swappable
> instructions should not be modified.
> 
> 2023-01-04  Surya Kumari Jangala  
> 
> gcc/
>   PR rtl-optimization/106770
>   * rs6000-p8swap.cc (rs6000_analyze_swaps): .
> 
> gcc/testsuite/
>   PR rtl-optimization/106770
>   * gcc.target/powerpc/pr106770.c: New test.
> ---
> 
> diff --git a/gcc/config/rs6000/rs6000-p8swap.cc 
> b/gcc/config/rs6000/rs6000-p8swap.cc
> index 19fbbfb67dc..7ed39251df9 100644
> --- a/gcc/config/rs6000/rs6000-p8swap.cc
> +++ b/gcc/config/rs6000/rs6000-p8swap.cc
> @@ -179,6 +179,9 @@ class swap_web_entry : public web_entry_base
>unsigned int special_handling : 4;
>/* Set if the web represented by this entry cannot be optimized.  */
>unsigned int web_not_optimizable : 1;
> +  /* Set if the web represented by this entry has been optimized, ie,
> + register swaps of permuting loads/stores have been removed.  */
> +  unsigned int web_is_optimized : 1;
>/* Set if this insn should be deleted.  */
>unsigned int will_delete : 1;
>  };
> @@ -2627,22 +2630,43 @@ rs6000_analyze_swaps (function *fun)
>/* For each load and store in an optimizable web (which implies
>   the loads and stores are permuting), find the associated
>   register swaps and mark them for removal.  Due to various
> - optimizations we may mark the same swap more than once.  Also
> - perform special handling for swappable insns that require it.  */
> + optimizations we may mark the same swap more than once. Fix up
> + the non-permuting loads and stores by converting them into
> + permuting ones.  */
>for (i = 0; i < e; ++i)
>  if ((insn_entry[i].is_load || insn_entry[i].is_store)
>   && insn_entry[i].is_swap)
>{
>   swap_web_entry* root_entry
> = (swap_web_entry*)((_entry[i])->unionfind_root ());
> - if (!root_entry->web_not_optimizable)
> + if (!root_entry->web_not_optimizable) {
> mark_swaps_for_removal (insn_entry, i);
> +  root_entry->web_is_optimized = true;
> +}
>}
> -else if (insn_entry[i].is_swappable && insn_entry[i].special_handling)
> +else if (insn_entry[i].is_swappable
> + && (insn_entry[i].special_handling == SH_NOSWAP_LD ||
> + insn_entry[i].special_handling == SH_NOSWAP_ST))
> +  {
> +swap_web_entry* root_entry
> +  = (swap_web_entry*)((_entry[i])->unionfind_root ());
> +if (!root_entry->web_not_optimizable) {
> +  handle_special_swappables (insn_entry, i);
> +  root_entry->web_is_optimized = true;
> +}
> +  }
> +
> +  /* Perform special handling for swappable insns that require it. 
> + Note that special handling should be done only for those 
> + swappable insns that are present in webs optimized above.  */
> +  for (i = 0; i < e; ++i)
> +if (insn_entry[i].is_swappable && insn_entry[i].special_handling &&
> +!(insn_entry[i].special_handling == SH_NOSWAP_LD || 
> +  insn_entry[i].special_handling == SH_NOSWAP_ST))
>{
>   swap_web_entry* root_entry
> = (swap_web_entry*)((_entry[i])->unionfind_root ());
> - if (!root_entry->web_not_optimizable)
> + if (root_entry->web_is_optimized)
> handle_special_swappables (insn_entry, i);
>}
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106770.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106770.c
> new file mode 100644
> index 000..84e9aead975
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106770.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power8 -O3 " } */
> +/* { dg-final { scan-assembler-times "xxpermdi" 2 } } */
> +
> +/* Test case to resolve PR106770  */
> +
> +#include 
> +
> +int cmp2(double a, double b)
> +{
> +vector double va = vec_promote(a, 1);
> +vector double vb = vec_promote(b, 1);
> +vector long long vlt = (vector long long)vec_cmplt(va, vb);
> +vector long long vgt = (vector long long)vec_cmplt(vb, va);
> +vector signed long long vr = vec_sub(vlt, vgt);
> +
> +return vec_extract(vr, 1);
> +}
> +


[PATCH] swap: Fix incorrect lane extraction by vec_extract() [PR106770]

2023-01-04 Thread Surya Kumari Jangala via Gcc-patches
swap: Fix incorrect lane extraction by vec_extract() [PR106770]

In the routine rs6000_analyze_swaps(), special handling of swappable
instructions is done even if the webs that contain the swappable
instructions are not optimized, i.e., the webs do not contain any
permuting load/store instructions along with the associated register
swap instructions. Doing special handling in such webs will result in
the extracted lane being adjusted unnecessarily for vec_extract.

Modifying swappable instructions is also incorrect in webs where
loads/stores on quad word aligned addresses are changed to lvx/stvx.
Similarly, in webs where swap(load(vector constant)) instructions are
replaced with load(swapped vector constant), the swappable
instructions should not be modified.

2023-01-04  Surya Kumari Jangala  

gcc/
PR rtl-optimization/106770
* rs6000-p8swap.cc (rs6000_analyze_swaps): .

gcc/testsuite/
PR rtl-optimization/106770
* gcc.target/powerpc/pr106770.c: New test.
---

diff --git a/gcc/config/rs6000/rs6000-p8swap.cc 
b/gcc/config/rs6000/rs6000-p8swap.cc
index 19fbbfb67dc..7ed39251df9 100644
--- a/gcc/config/rs6000/rs6000-p8swap.cc
+++ b/gcc/config/rs6000/rs6000-p8swap.cc
@@ -179,6 +179,9 @@ class swap_web_entry : public web_entry_base
   unsigned int special_handling : 4;
   /* Set if the web represented by this entry cannot be optimized.  */
   unsigned int web_not_optimizable : 1;
+  /* Set if the web represented by this entry has been optimized, ie,
+ register swaps of permuting loads/stores have been removed.  */
+  unsigned int web_is_optimized : 1;
   /* Set if this insn should be deleted.  */
   unsigned int will_delete : 1;
 };
@@ -2627,22 +2630,43 @@ rs6000_analyze_swaps (function *fun)
   /* For each load and store in an optimizable web (which implies
  the loads and stores are permuting), find the associated
  register swaps and mark them for removal.  Due to various
- optimizations we may mark the same swap more than once.  Also
- perform special handling for swappable insns that require it.  */
+ optimizations we may mark the same swap more than once. Fix up
+ the non-permuting loads and stores by converting them into
+ permuting ones.  */
   for (i = 0; i < e; ++i)
 if ((insn_entry[i].is_load || insn_entry[i].is_store)
&& insn_entry[i].is_swap)
   {
swap_web_entry* root_entry
  = (swap_web_entry*)((_entry[i])->unionfind_root ());
-   if (!root_entry->web_not_optimizable)
+   if (!root_entry->web_not_optimizable) {
  mark_swaps_for_removal (insn_entry, i);
+  root_entry->web_is_optimized = true;
+}
   }
-else if (insn_entry[i].is_swappable && insn_entry[i].special_handling)
+else if (insn_entry[i].is_swappable
+ && (insn_entry[i].special_handling == SH_NOSWAP_LD ||
+ insn_entry[i].special_handling == SH_NOSWAP_ST))
+  {
+swap_web_entry* root_entry
+  = (swap_web_entry*)((_entry[i])->unionfind_root ());
+if (!root_entry->web_not_optimizable) {
+  handle_special_swappables (insn_entry, i);
+  root_entry->web_is_optimized = true;
+}
+  }
+
+  /* Perform special handling for swappable insns that require it. 
+ Note that special handling should be done only for those 
+ swappable insns that are present in webs optimized above.  */
+  for (i = 0; i < e; ++i)
+if (insn_entry[i].is_swappable && insn_entry[i].special_handling &&
+!(insn_entry[i].special_handling == SH_NOSWAP_LD || 
+  insn_entry[i].special_handling == SH_NOSWAP_ST))
   {
swap_web_entry* root_entry
  = (swap_web_entry*)((_entry[i])->unionfind_root ());
-   if (!root_entry->web_not_optimizable)
+   if (root_entry->web_is_optimized)
  handle_special_swappables (insn_entry, i);
   }
 
diff --git a/gcc/testsuite/gcc.target/powerpc/pr106770.c 
b/gcc/testsuite/gcc.target/powerpc/pr106770.c
new file mode 100644
index 000..84e9aead975
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr106770.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mdejagnu-cpu=power8 -O3 " } */
+/* { dg-final { scan-assembler-times "xxpermdi" 2 } } */
+
+/* Test case to resolve PR106770  */
+
+#include 
+
+int cmp2(double a, double b)
+{
+vector double va = vec_promote(a, 1);
+vector double vb = vec_promote(b, 1);
+vector long long vlt = (vector long long)vec_cmplt(va, vb);
+vector long long vgt = (vector long long)vec_cmplt(vb, va);
+vector signed long long vr = vec_sub(vlt, vgt);
+
+return vec_extract(vr, 1);
+}
+


Re: [PATCH] sched1: Fix -fcompare-debug issue in schedule_region [PR105586]

2022-11-08 Thread Surya Kumari Jangala via Gcc-patches
Hi Richard,

On 21/09/22 1:03 pm, Richard Biener wrote:
> On Tue, Sep 20, 2022 at 9:18 AM Surya Kumari Jangala via Gcc-patches
>  wrote:
>>
>> Hi Jeff, Richard,
>> Thank you for reviewing the patch!
>> I have committed the patch to the gcc repo.
>> Can I backport this patch to prior versions of gcc, as this is an easy patch 
>> to backport and the issue exists in prior versions too?
> 
> It doesn't seem to be a regression so I'd error on the safe side here.

Can you please clarify, should this patch be backported? It is not very clear 
what "safe side" means here.

Thanks!
Surya

> 
> Richard.
> 
>> Regards,
>> Surya
>>
>>
>> On 31/08/22 9:09 pm, Jeff Law via Gcc-patches wrote:
>>>
>>>
>>> On 8/23/2022 5:49 AM, Surya Kumari Jangala via Gcc-patches wrote:
>>>> sched1: Fix -fcompare-debug issue in schedule_region [PR105586]
>>>>
>>>> In schedule_region(), a basic block that does not contain any real insns
>>>> is not scheduled and the dfa state at the entry of the bb is not copied
>>>> to the fallthru basic block. However a DEBUG insn is treated as a real
>>>> insn, and if a bb contains non-real insns and a DEBUG insn, it's dfa
>>>> state is copied to the fallthru bb. This was resulting in
>>>> -fcompare-debug failure as the incoming dfa state of the fallthru block
>>>> is different with -g. We should always copy the dfa state of a bb to
>>>> it's fallthru bb even if the bb does not contain real insns.
>>>>
>>>> 2022-08-22  Surya Kumari Jangala  
>>>>
>>>> gcc/
>>>> PR rtl-optimization/105586
>>>> * sched-rgn.cc (schedule_region): Always copy dfa state to
>>>> fallthru block.
>>>>
>>>> gcc/testsuite/
>>>> PR rtl-optimization/105586
>>>> * gcc.target/powerpc/pr105586.c: New test.
>>> Interesting.We may have stumbled over this bug internally a little 
>>> while ago -- not from a compare-debug standpoint, but from a "why isn't the 
>>> processor state copied to the fallthru block" point of view.   I had it on 
>>> my to investigate list, but hadn't gotten around to it yet.
>>>
>>> I think there were requests for ChangeLog updates and a function comment 
>>> for save_state_for_fallthru_edge.  OK with those updates.
>>>
>>> jeff
>>>


[PATCH] testsuite: Fix failure in test pr105586.c [PR107171]

2022-10-13 Thread Surya Kumari Jangala via Gcc-patches
testsuite: Fix failure in test pr105586.c [PR107171]

The test pr105586.c fails on a big endian system when run in 32bit
mode. The failure occurs as the test case does not guard against
unsupported __int128.

2022-10-13  Surya Kumari Jangala  

gcc/testsuite/
PR testsuite/107171
* gcc.target/powerpc/pr105586.c: Guard against unsupported
__int128.


diff --git a/gcc/testsuite/gcc.target/powerpc/pr105586.c 
b/gcc/testsuite/gcc.target/powerpc/pr105586.c
index bd397f5..3f88a09 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr105586.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr105586.c
@@ -1,4 +1,5 @@
 /* { dg-options "-mdejagnu-tune=power4 -O2 -fcompare-debug -fno-if-conversion 
-fno-guess-branch-probability" } */
+/* { dg-require-effective-target int128 } */
 
 extern int bar(int i);


Re: [PATCH] sched1: Fix -fcompare-debug issue in schedule_region [PR105586]

2022-09-20 Thread Surya Kumari Jangala via Gcc-patches
Hi Jeff, Richard,
Thank you for reviewing the patch!
I have committed the patch to the gcc repo.
Can I backport this patch to prior versions of gcc, as this is an easy patch to 
backport and the issue exists in prior versions too?

Regards,
Surya


On 31/08/22 9:09 pm, Jeff Law via Gcc-patches wrote:
> 
> 
> On 8/23/2022 5:49 AM, Surya Kumari Jangala via Gcc-patches wrote:
>> sched1: Fix -fcompare-debug issue in schedule_region [PR105586]
>>
>> In schedule_region(), a basic block that does not contain any real insns
>> is not scheduled and the dfa state at the entry of the bb is not copied
>> to the fallthru basic block. However a DEBUG insn is treated as a real
>> insn, and if a bb contains non-real insns and a DEBUG insn, it's dfa
>> state is copied to the fallthru bb. This was resulting in
>> -fcompare-debug failure as the incoming dfa state of the fallthru block
>> is different with -g. We should always copy the dfa state of a bb to
>> it's fallthru bb even if the bb does not contain real insns.
>>
>> 2022-08-22  Surya Kumari Jangala  
>>
>> gcc/
>> PR rtl-optimization/105586
>> * sched-rgn.cc (schedule_region): Always copy dfa state to
>> fallthru block.
>>
>> gcc/testsuite/
>> PR rtl-optimization/105586
>> * gcc.target/powerpc/pr105586.c: New test.
> Interesting.    We may have stumbled over this bug internally a little while 
> ago -- not from a compare-debug standpoint, but from a "why isn't the 
> processor state copied to the fallthru block" point of view.   I had it on my 
> to investigate list, but hadn't gotten around to it yet.
> 
> I think there were requests for ChangeLog updates and a function comment for 
> save_state_for_fallthru_edge.  OK with those updates.
> 
> jeff
> 


Re: [PATCH] sched1: Fix -fcompare-debug issue in schedule_region [PR105586]

2022-08-24 Thread Surya Kumari Jangala via Gcc-patches
Hi Peter, Segher,
Thanks for going thru the patch!
I will make the proposed changes to the Changelog.

Regards,
Surya


On 23/08/22 6:58 pm, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Aug 23, 2022 at 07:55:22AM -0500, Peter Bergner wrote:
>> It looks good to me, but I cannot approve it.
> 
> Same here (both statements).
> 
>> That said, you're missing
>> a ChangeLog entry for your new helper function.  The ChangeLog mentions
>> what changed, not why it changed.
> 
> And that is correct!  Changelogs should not say that, that isn't their
> purpose (in GCC), not what they are used for.  Explanations like that go
> in the email and/or the commit message.
> 
> The main remaining usefulness of changelogs is to spot unintended
> commmits.
> 
>> Maybe something like the following?
>>
>> gcc/
>>  PR rtl-optimization/105586
>>  * sched-rgn.cc (save_state_for_fallthru_edge): New function.
>>  (schedule_region): Use it for all blocks.
> 
> That looks perfect, it doesn't say "why" either :-)
> 
> 
> Segher


[PATCH] sched1: Fix -fcompare-debug issue in schedule_region [PR105586]

2022-08-23 Thread Surya Kumari Jangala via Gcc-patches
sched1: Fix -fcompare-debug issue in schedule_region [PR105586]

In schedule_region(), a basic block that does not contain any real insns
is not scheduled and the dfa state at the entry of the bb is not copied
to the fallthru basic block. However a DEBUG insn is treated as a real
insn, and if a bb contains non-real insns and a DEBUG insn, it's dfa
state is copied to the fallthru bb. This was resulting in
-fcompare-debug failure as the incoming dfa state of the fallthru block
is different with -g. We should always copy the dfa state of a bb to
it's fallthru bb even if the bb does not contain real insns.

2022-08-22  Surya Kumari Jangala  

gcc/
PR rtl-optimization/105586
* sched-rgn.cc (schedule_region): Always copy dfa state to
fallthru block.

gcc/testsuite/
PR rtl-optimization/105586
* gcc.target/powerpc/pr105586.c: New test.


diff --git a/gcc/sched-rgn.cc b/gcc/sched-rgn.cc
index 0dc2a8f2851..6c9202c0b2b 100644
--- a/gcc/sched-rgn.cc
+++ b/gcc/sched-rgn.cc
@@ -3082,6 +3082,24 @@ free_bb_state_array (void)
   bb_state = NULL;
 }
 
+static void
+save_state_for_fallthru_edge (basic_block last_bb, state_t state)
+{
+  edge f = find_fallthru_edge (last_bb->succs);
+  if (f
+  && (!f->probability.initialized_p ()
+ || (f->probability.to_reg_br_prob_base () * 100
+ / REG_BR_PROB_BASE
+ >= param_sched_state_edge_prob_cutoff)))
+  {
+memcpy (bb_state[f->dest->index], state,
+   dfa_state_size);
+if (sched_verbose >= 5)
+  fprintf (sched_dump, "saving state for edge %d->%d\n",
+  f->src->index, f->dest->index);
+  }
+}
+
 /* Schedule a region.  A region is either an inner loop, a loop-free
subroutine, or a single basic block.  Each bb in the region is
scheduled after its flow predecessors.  */
@@ -3155,6 +3173,7 @@ schedule_region (int rgn)
   if (no_real_insns_p (head, tail))
{
  gcc_assert (first_bb == last_bb);
+ save_state_for_fallthru_edge (last_bb, bb_state[first_bb->index]);
  continue;
}
 
@@ -3173,26 +3192,13 @@ schedule_region (int rgn)
   curr_bb = first_bb;
   if (dbg_cnt (sched_block))
 {
- edge f;
  int saved_last_basic_block = last_basic_block_for_fn (cfun);
 
  schedule_block (_bb, bb_state[first_bb->index]);
  gcc_assert (EBB_FIRST_BB (bb) == first_bb);
  sched_rgn_n_insns += sched_n_insns;
  realloc_bb_state_array (saved_last_basic_block);
- f = find_fallthru_edge (last_bb->succs);
- if (f
- && (!f->probability.initialized_p ()
- || (f->probability.to_reg_br_prob_base () * 100
- / REG_BR_PROB_BASE
- >= param_sched_state_edge_prob_cutoff)))
-   {
- memcpy (bb_state[f->dest->index], curr_state,
- dfa_state_size);
- if (sched_verbose >= 5)
-   fprintf (sched_dump, "saving state for edge %d->%d\n",
-f->src->index, f->dest->index);
-   }
+ save_state_for_fallthru_edge (last_bb, curr_state);
 }
   else
 {
diff --git a/gcc/testsuite/gcc.target/powerpc/pr105586.c 
b/gcc/testsuite/gcc.target/powerpc/pr105586.c
new file mode 100644
index 000..bd397f58bc0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr105586.c
@@ -0,0 +1,19 @@
+/* { dg-options "-mdejagnu-tune=power4 -O2 -fcompare-debug -fno-if-conversion 
-fno-guess-branch-probability" } */
+
+extern int bar(int i);
+
+typedef unsigned long u64;
+int g;
+
+__int128 h;
+
+void
+foo(int a, int b) {
+  int i;
+  char u8_1 = g, u8_3 = a;
+  u64 u64_1 = bar(0), u64_3 = u8_3 * u64_1;
+  __int128 u128_1 = h ^ __builtin_expect(i, 0);
+  u64 u64_4 = u64_1 << ((__builtin_sub_overflow_p(0, u8_1, (u64)0) ^ u128_1) & 
8);
+  u64 u64_r = b + u64_3 + u64_4;
+  bar((short)u64_r + u8_3);
+}


Re: [RFC] Analysis of PR105586 and possible approaches to fix issue

2022-07-27 Thread Surya Kumari Jangala via Gcc
Hi Richard,

On 27/07/22 12:28 pm, Richard Biener wrote:
> On Tue, Jul 26, 2022 at 8:55 PM Surya Kumari Jangala via Gcc
>  wrote:

>> To fix the issue of insns being assigned different cycles, there are two 
>> possible solutions:
>>
>> 1. Modify no_real_insns_p() to treat a DEBUG insn as a non-real insn 
>> (similar to NOTE and LABEL). With this change, bb 3 will not be scheduled in 
>> the debug mode (as it contains only NOTE and DEBUG insns). If scheduling is 
>> skipped, then bb 3’s state is not copied to bb 4 and the initial dfa state 
>> of bb 4 will be same in both debug and non-debug modes
>> 2. Copy dfa state of a basic block to it’s fall-thru block only if the basic 
>> block contains ‘real’ insns (ie, it should contain at least one insn which 
>> is not a LABEL, NOTE or DEBUG). This will prevent copying of dfa state from 
>> bb 3 to bb 4 in debug mode.
> 
> Do you know why the DFA state is not always copied to the fallthru
> destination and then copied further even if the block does not contain

I am not sure why the DFA state is not always copied to the fallthru 
destination. It is not very apparent from the code.

> any (real) insns?  It somewhat sounds like premature optimization
> breaking things here...
> 

Now that you mention it, yes it does seem suboptimal that the DFA state is not 
always copied. I don’t see any reason why the DFA state shouldn’t be always 
copied. And I think this is the fix for this bug. '-g' mode is behaving 
correctly, it is the non-debug mode which is incorrect.

Thanks,
Surya



[RFC] Analysis of PR105586 and possible approaches to fix issue

2022-07-26 Thread Surya Kumari Jangala via Gcc
Hi,
I am working on PR105586. This is a -fcompare-debug failure, with the 
differences starting during sched1 pass. The sequence of two instructions in a 
basic block (block 4) is flipped with -g.
In addition to this, another difference is that an insn is assigned a different 
cycle in debug vs non-debug modes.
More specifically, the 2nd instruction in basic block 4 is assigned to cycle 0 
w/o -g but to cycle 1 w/ -g. I felt that this could be resulting in the 
flipping of the later insns in the bb, so I started to investigate the 
difference in cycle assignment.

In the routine schedule_block(), after scheduling an insn(schedule_insn()), 
prune_ready_list() is called if the ready list is not empty. This routine goes 
thru all the insns in the ready list and for each insn it checks if there is a 
state transition. If there is no state transition, then INSN_TICK(insn) is set 
to current_cycle+1.

After scheduling the first insn in bb 4, when prune_ready_list() is called, we 
see that for the debug mode run, there is no state transition for the second 
insn and hence it's INSN_TICK is updated. For the non-debug run, a state 
transition exists and the INSN_TICK is not updated. This was resulting in the 
second insn being scheduled in cycle 1 in the debug mode, and in cycle 0 in the 
non-debug mode.

It turned out that the initial dfa state of the basic block (‘init_state’ 
parameter of schedule_block()) was different in debug and non-debug modes. 

After scheduling a basic block, it’s current dfa state is copied to the 
fall-thru basic block. In other words, the initial dfa state of the fall thru 
bb is the current state of the bb that was just scheduled. 

Basic block 4 is the fall-thru bb for basic block 3. In non-debug mode, bb 3 
has only a NOTE insn and hence scheduling of bb 3 is skipped. Since bb 3 is not 
scheduled, it’s state is not copied to bb 4. Whereas in debug mode, bb3 has a 
NOTE insn and a DEBUG insn. So bb 3 is “scheduled” and it’s dfa state is copied 
to bb4. [The dfa state of bb 3 is obtained from it’s parent bb, ie, bb 2]. 
Hence the initial dfa state of bb 4 is different in debug and non-debug modes 
due to the difference in the insns in the predecessor bb (bb 3).

The routine no_real_insns_p() is called to check if scheduling can be skipped 
for a basic block. This routine checks for NOTE and LABEL insns and it returns 
‘true’ if a basic block contains only NOTE/LABEL insns. Hence, any basic block 
which has only NOTE or LABEL insns is not scheduled.

To fix the issue of insns being assigned different cycles, there are two 
possible solutions:

1. Modify no_real_insns_p() to treat a DEBUG insn as a non-real insn (similar 
to NOTE and LABEL). With this change, bb 3 will not be scheduled in the debug 
mode (as it contains only NOTE and DEBUG insns). If scheduling is skipped, then 
bb 3’s state is not copied to bb 4 and the initial dfa state of bb 4 will be 
same in both debug and non-debug modes
2. Copy dfa state of a basic block to it’s fall-thru block only if the basic 
block contains ‘real’ insns (ie, it should contain at least one insn which is 
not a LABEL, NOTE or DEBUG). This will prevent copying of dfa state from bb 3 
to bb 4 in debug mode.

I decided to take approach 1 and I changed no_real_insns_p() to check for DEBUG 
insns in addition to NOTE and LABEL insns.

But this resulted in an internal compiler error in the bug's testcase. The 
reason was that dependency nodes and lists of the insns in a basic block are 
not freed after the bb is scheduled. The routine sched_analyze() allocates 
dependency nodes and lists for each insn in an extended basic block only if the 
insn is NONDEBUG_INSN or DEBUG_INSN. So in debug mode, the scheduler allocated 
dependencies for bb 3 but in non-debug mode, there are no dependencies 
allocated. The dependencies are freed after all the blocks in a region are 
scheduled. But the routine to free the dependencies is called for a bb only if 
no_real_insns_p() returns true for that bb. With approach 1, no_real_insns_p() 
returns true for bb 3 and hence the dependencies are not freed.

I added some code to not create dependencies if a bb contains only NOTE, LABEL 
and DEBUG insns. This makes the test pass but I am hitting an internal compiler 
error during bootstrapping.

I wish to get some inputs/feedback if approach 1 is the correct way to fix the 
issue, or should I take approach 2.

Thanks,
Surya


[PATCH] regrename: Fix -fcompare-debug issue in check_new_reg_p [PR105041]

2022-06-10 Thread Surya Kumari Jangala via Gcc-patches
regrename: Fix -fcompare-debug issue in check_new_reg_p [PR105041]

In check_new_reg_p, the nregs of a du chain is computed by obtaining the MODE
of the first element in the chain, and then calling hard_regno_nregs() with the
MODE. But the first element of the chain can be a DEBUG_INSN whose mode need
not be the same as the rest of the elements in the du chain. This
was resulting in fcompare-debug failure as check_new_reg_p was returning a
different result with -g for the same candidate register. We can instead obtain
nregs from the du chain itself.

2022-06-10  Surya Kumari Jangala  

gcc/
PR rtl-optimization/105041
* regrename.cc (check_new_reg_p): Use nregs value from du chain.

gcc/testsuite/
PR rtl-optimization/105041
* gcc.target/powerpc/pr105041.c: New test.


diff --git a/gcc/regrename.cc b/gcc/regrename.cc
index 10271e1..f651351 100644
--- a/gcc/regrename.cc
+++ b/gcc/regrename.cc
@@ -324,8 +324,7 @@ static bool
 check_new_reg_p (int reg ATTRIBUTE_UNUSED, int new_reg,
 class du_head *this_head, HARD_REG_SET this_unavailable)
 {
-  machine_mode mode = GET_MODE (*this_head->first->loc);
-  int nregs = hard_regno_nregs (new_reg, mode);
+  int nregs = this_head->nregs;
   int i;
   struct du_chain *tmp;
 
diff --git a/gcc/testsuite/gcc.target/powerpc/pr105041.c 
b/gcc/testsuite/gcc.target/powerpc/pr105041.c
new file mode 100644
index 000..89eed1c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr105041.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target be } */
+/* { dg-options "-m32 -mdejagnu-cpu=power4 -O2 -fcompare-debug 
-fharden-compares -frename-registers" } */
+
+double m;
+int n;
+
+unsigned int
+foo (unsigned int x, int y)
+{
+  long long int a = y, b = !a;
+  int c = 0;
+
+  if (b != x)
+while ((int) m == a)
+  {
+c = a;
+a = 0;
+  }
+
+  n = b = y;
+
+  return x + c;
+}


[COMMITTED] MAINTAINERS: Add myself for write after approval

2022-05-13 Thread Surya Kumari Jangala via Gcc-patches

2022-05-13  Surya Kumari Jangala 

    * MAINTAINERS: Add myself to write after approval.

diff --git a/MAINTAINERS b/MAINTAINERS
index a1b84ac5646..8bca7a636b7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -464,6 +464,7 @@ Daniel Jacobowitz 
 Andreas Jaeger 
 Harsha Jagasia 
 Fariborz Jahanian 
+Surya Kumari Jangala 
 Qian Jianhua 
 Janis Johnson 
 Teresa Johnson