Re: Register Allocation issues with microblaze-elf

Vladimir Makarov Wed, 13 Feb 2013 17:28:26 -0800

On 13-02-13 6:36 PM, Michael Eager wrote:

On 02/13/2013 02:38 PM, Vladimir Makarov wrote:

On 13-02-13 1:36 AM, Michael Eager wrote:

Hi --


I'm seeing register allocation problems and code size increases
with gcc-4.6.2 (and gcc-head) compared with older (gcc-4.1.2).
Both are compiled using -O3.

One test case that I have has a long series of nested if's
each with the same comparison and similar computation.

        if (n<max_no){
          n+=*(cp-*p++);
          if (n<max_no){
            n+=*(cp-*p);
              if (n<max_no){
        . . .          ~20 levels of nesting
               <more computations with 'cp' and 'p'>
                . . . }}}

Gcc-4.6.2 generates many blocks like the following:
    lwi    r28,r1,68    -- load into dead reg
    lwi    r31,r1,140    -- load p from stack
    lbui    r28,r31,0
    rsubk    r31,r28,r19
    lbui    r31,r31,0
    addk    r29,r29,r31
    swi    r31,r1,308
    lwi    r31,r1,428    -- load of max_no from stack
    cmp    r28,r31,r29    -- n in r29
    bgeid    r28,$L46

gcc-4.1.2 generates the following:
    lbui    r3,r26,3
    rsubk    r3,r3,r19
    lbui    r3,r3,0
    addk    r30,r30,r3
    swi    r3,r1,80
    cmp    r18,r9,r30    -- max_no in r9, n in r30
    bgei    r18,$L6

gcc-4.6.2 (and gcc-head) load max_no from the stack in each block.
There also are extra loads into r28 (which is not used) and r31 at
the start of each block.  Only r28, r29, and r31 are used.

I'm having a hard time telling what is happening or why.  The
IRA dump has this line:
   Ignoring reg 772, has equiv memory
where pseudo 772 is loaded with max_no early in the function.

The reload dump has
Reloads for insn # 254
Reload 0: reload_in (SI) = (reg/v:SI 722 [ max_no ])
    GR_REGS, RELOAD_FOR_INPUT (opnum = 1)
    reload_in_reg: (reg/v:SI 722 [ max_no ])
    reload_reg_rtx: (reg:SI 31 r31)
and similar for each of the other insns using 722.

This is followed by
  Spilling for insn 254.
  Using reg 31 for reload 0
for each insn using pseudo 722.

Any idea what is going on?

So many changes happened since then (7 years ago), that it is veryhard to me to say somethingdefinitely. I also have no gcc-4.1 microblaze (as I see microblazewas added to public gcc for 4.6

version) and it makes me even more difficult to say something useful.

First of all, the new RA was introduced in gcc4.4 (IRA) which usesdifferent heuristics

(Chaitin-Briggs graph coloring vs Chow's priority RA).

We could blame IRA when we have the same started conditions for it RAgcc4.1 and gcc4.6-gcc-4.8.But I am sure it is not the same. More aggressive optimizationscreates higher register pressure. Icompared peak reg pressure in the test for gcc4.6 and gcc4.8. Itbecame higher (from 102 to 106).

I guess the increase was even bigger since gcc4.1.

I thought about register pressure causing this, but I think thatshould causespilling of one of the registers which were not used in this longsequence,

rather than causing a large number of additional loads.

Longer living pseudos has more probability to be spilled as they usuallyconflicts with more pseudos during their lives and spilling them frees ahard reg for many conflicting pseudos. That how RA heuristics work (inthe old RA log (live range span) was used. The bigger the value, themore probability for spilling).

Perhaps the cost analysis has a problem.

I've checked it and it looks ok to me.

RA focused on generation of faster code. Looking at the fragment youprovided it, it is hard to saysomething about it. I tried -Os for gcc4.8 and it generatesdesirable code for the fragment inquestion (by the way the peak register pressure decreased to 66 inthis case).
It's both larger and slower, since the additional loads take muchlonger. I'll take a
look at -Os.
It looks like the values of p++ are being pre-calculated and stored onthe stack. This results in
a load, rather than an increment of a register.

If it is so. It might be another optimization which moves p++calculations higher. IRA does not do it (more correctly a new IRAfeature implemented by Bernd Schmidt in gcc4.7 can move insns downwardsin CFG graph to decrease reg pressure).

I checked all rtl passes these calcualtions are not created by any RTLpass. So it is probably some tree-ssa optimization.

Any industrial RA uses heuristic algorithms, in some cases betterheuristics can work worse thanworse heuristics. So you should probably check is there any progressmoving from gcc4.1 to gcc4.6with performance point of view for variety benchmarks. IntroducingIRA improves code for x86 4% onSPEC2000. Subsequent improving (like using dynamic register classes)made further performance
improvements.
My impression is that the performance is worse. Reports I've seen arethat the code is
substantially larger, which means more instructions.
I'm skeptical about comparisons between x86 and RISC processors. Whatworks well
for one may not work well for the other.

IRA improved code for many RISC processors. Although tetter RA hassmaller effect for these processors as they have more registers.

Looking at the test code, I can make some conclusions for myself:
o We need a common pass decreasing reg pressure (I already expressedthis in the past) asoptimizations become more aggressive. Some progress was made to makefew optimizations aware aboutRA (reg-pressure scheduling, loop-invariant motions, and codehoisting) but there are too manypasses and it is wrong and impossible to make them all aware of RA.Some register pressuredecreasing heuristics are difficult to implement in RA (like insnrearrangements or complex
rematerialization) and this pass could focus on them.
That might be useful.
o Implement RA live range splitting in regions different from loopsor BB (now IRA makes splittingonly on loop bounds and LRA in BB, the old RA had no live rangesplitting at all).
Each of the blocks of code is in it's own BB. I haven't checked, butI'd guessthat most of the registers are in use on entry and still live on exit,so the
block has no registers to allocate.

Splitting in BB scope this case is not profitable.

I'd also recommend to try the following options concerning RA:-fira-loop-pressure,-fsched-pressure, -fira-algorithm=CB|priority,-fira-region=one,all,mixed. Actually-fira-algorithm=priority + -fira-region=one is analog of what the oldRA did.

Re: Register Allocation issues with microblaze-elf

Reply via email to