4.5 regression] unoptimal code generated

vmakarov at redhat dot com Sun, 26 Apr 2009 10:38:54 -0700


------- Comment #4 from vmakarov at redhat dot com  2009-04-26 17:38 -------
The first test case is just an example that RA is a heuristic
solution.  Even heuristic algorithm which works worse in average
sometimes can generate a better solution than ones working better in
average.


Here is IRA log of pseudo assignments.  Different assignments made by
the old RA are in parenthesis:

      Popping a0(r133,l0)  -- assign reg 4 
      Popping a4(r138,l0)  -- assign reg 5
      Popping a3(r145,l0)  -- assign reg 6
      Popping a7(r137,l0)  -- assign reg 7  (4) <-- key point
      Popping a2(r146,l0)  -- spill         (7)
      Popping a9(r139,l0)  -- assign reg 2
      Popping a10(r147,l0)  -- assign reg 6 (1)
      Popping a1(r134,l0)  -- assign reg 0
      Popping a5(r143,l0)  -- assign reg 7  (4)
      Popping a6(r135,l0)  -- assign reg 0
      Popping a8(r140,l0)  -- assign reg 5  (2)

If IRA assigned hard reg 4 instead of 7 to pseudo 137, all pseudos
would get registers including r139.  The old RA assigns 4 to r137
because it prefers the first hard reg in register allocation order if
all other conditions are equal.  IRA assigns 7 because it has smaller
cost than 4.  The cost is smaller because IRA hopes that r135 could
get 4 later (which is not happened).  That is because there is a copy
connected the two allocnos.

  cp4:a0(r133)<->a6(r135)@125:shuffle

This copy (called shuffle) is originated from case as in the following
situation

r2 = op (..., r1) and r1 is becoming dead.

Assigning the hard register of r1 to r2 in this situation usually
results in better RA (please read a literature about RA).

The following patch ignoring the shuffle copies results in the same RA
for this test:

Index: gcc/ira-color.c
===================================================================
--- gcc/ira-color.c     (revision 146788)
+++ gcc/ira-color.c     (working copy)
@@ -298,16 +298,18 @@ update_copy_costs (ira_allocno_t allocno
            (&ALLOCNO_UPDATED_HARD_REG_COSTS (another_allocno), cover_class,
             ALLOCNO_UPDATED_COVER_CLASS_COST (another_allocno),
             ALLOCNO_HARD_REG_COSTS (another_allocno));
-         ira_allocate_and_set_or_copy_costs
-           (&ALLOCNO_UPDATED_CONFLICT_HARD_REG_COSTS (another_allocno),
-            cover_class, 0,
-            ALLOCNO_CONFLICT_HARD_REG_COSTS (another_allocno));
          i = ira_class_hard_reg_index[cover_class][hard_regno];
          ira_assert (i >= 0);
          ALLOCNO_UPDATED_HARD_REG_COSTS (another_allocno)[i] += update_cost;
-         ALLOCNO_UPDATED_CONFLICT_HARD_REG_COSTS (another_allocno)[i]
-           += update_cost;
-
+         if (cp->insn != NULL_RTX || cp->constraint_p)
+           {
+             ira_allocate_and_set_or_copy_costs
+               (&ALLOCNO_UPDATED_CONFLICT_HARD_REG_COSTS (another_allocno),
+                cover_class, 0,
+                ALLOCNO_CONFLICT_HARD_REG_COSTS (another_allocno));
+             ALLOCNO_UPDATED_CONFLICT_HARD_REG_COSTS (another_allocno)[i]
+               += update_cost;
+           }
          queue_update_cost (another_allocno, divisor * COST_HOP_DIVISOR);
        }
     }

But this patch does not improve RA in average (today I checked code
size and performance of SPEC2000 for Core i7 in 32-bit mode).  So I
have no sense to commit the patch.

As for the second test I did not find a difference to worry about (I
checked the mainline and gcc-4.3.1 using -march=armv5te
-mthumb-interwork -Os -mthumb):

--- a0.s        2009-04-24 17:40:54.000000000 -0400
+++ a1.s        2009-04-24 17:40:36.000000000 -0400
@@ -1,8 +1,5 @@
        .code   16
        .file   "a.i"
-       .section        .rodata
-.LC0:
-       .ascii  "abc\000"
        .text
        .align  2
        .global test
@@ -11,31 +8,35 @@
        .type   test, %function
 test:
        push    {r4, r5, r6, r7, lr}
-       ldr     r5, [r0]
-       mov     r7, r0
-       ldrb    r3, [r5]
+       ldr     r4, [r0]
+       mov     r5, r0
+       ldrb    r3, [r4]
        cmp     r3, #91
        bne     .L2
-       mov     r0, r5
-       ldrb    r4, [r5, #1]
+       mov     r0, r4
+       ldrb    r7, [r4, #1]
        bl      func
-       add     r6, r5, #1
-       strb    r4, [r5, #1]
+       add     r6, r4, #1
+       strb    r7, [r4, #1]
        cmp     r0, #0
        bne     .L3
-       mov     r5, r6
+       mov     r4, r6
 .L2:
        ldr     r0, .L5
        mov     r1, #0
        bl      func2
-       mov     r6, r5
+       mov     r6, r4
 .L3:
-       str     r6, [r7]
+       str     r6, [r5]
        @ sp needed for prologue
        pop     {r4, r5, r6, r7, pc}
 .L6:
        .align  2
 .L5:
-       .word   .LC0
+       .word   .LANCHOR0
        .size   test, .-test
-       .ident  "GCC: (GNU) 4.3.4 20090424 (prerelease)"
+       .section        .rodata
+       .set    .LANCHOR0,. + 0
+.LC0:
+       .ascii  "abc\000"
+       .ident  "GCC: (GNU) 4.5.0 20090423 (experimental)"

--------------------------------------------------------------------

Once again RA is a heuristic algorithm (optimal solution for some RA
models, e.g. based on ILP or algorithms for quadratic assignment
problems solutions, are not useful in practice).  It is possible to
find a lot of tests where the old RA works better than IRA.  Analysis
of the tests takes a lot of time (although I did a lot of them last 2
years working on IRA).  Alexander, I'd really appreciate if you did
more analysis and proposed the solutions of the found problem instead
of just posting the tests.  Although posting the tests is useful too
but doing just this will make such PRs less and less priority for me.
I checked several times that IRA generates smaller code on bigger
tests for ARM, So there is already a progress in RA in comparison with
the old RA.  But of course, there is no limit for perfection.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39836

[Bug rtl-optimization/39836] [4.4/4.5 regression] unoptimal code generated

Reply via email to