Hello,

Below is a description of a very annoying bug we are witnessing
on ARM. I have ideas of possible ways to address it, but don't yet
have a patch.

The problem touches the synchronization of register elimination,
dataflow analysis and arm_expand_prologue. The overall logic is pretty
intricate, so I would appreciate at least insights from maintainers on
what would seem the most plausible direction.

If someone wants to take it, even better. I can certainly open a PR :)


The Ada testcase below, compiled with -Og -mapcs by an arm-eabi
compiler or simply with -Og by an arm-wrs-vxworks compiler (apcs
by default), produces code which uses a stack slot beyond the
allocated stack area for the "for_each" nested function. 

The slot at hand is used as the stack home for the static chain,
which gets clobbered on the first call to Read and then bad things
happen. The actual manifestation we observed with the real code
was "fun" but I'll save that for the time being - that message is
long enough :)

We started observing this with a gcc-8 compiler and I could reproduce
with a pristine mainline early this week.

Here's a example compilation and the output we get, with notes
describing what register points where as the code proceeds:

$ arm-wrs-vxworks-gcc -S  --RTS=kernel p.adb -Og  -gnatn -o - |& grep -A 30 
"for_each.*:"

p__compute_blocks__for_each_item$2970:
.LFB5:
        .cfi_startproc
        @ Nested: function declared inside another function.
        @ args = 0, pretend = 0, frame = 36
        @ frame_needed = 1, uses_anonymous_args = 0

        str     ip, [sp, #-4]!  -- push ip

        add     ip, sp, #4      -- ip = entry_sp

        push    {r4, r5, r6, r7, r8, r9, r10, fp, ip, lr, pc} -- push 11 regs, 
44 bytes

        sub     fp, ip, #8  -- fp = ip - 8, that is, entry_sp - 8, 
                            -- first registers save slot.

        ldr     ip, [fp, #4]

        sub     sp, sp, #32  -- allocate 32 more bytes. sp is entry_sp - 80 here
                             -- (push ip = 4 bytes, multi push = 44 bytes, then
                             --  allocate 32)
        mov     r8, r0
        str     r1, [fp, #-68]
        str     ip, [fp, #-76] -- store ip at fp - 76, that is, entry_sp - 84,
                               -- 4 bytes beyond the end of the allocated frame.


The problematic store is initially emitted as

  p.adb.235r.expand:(insn 3 2 4 2 (set (reg/f:SI 111 [ CHAIN$45 ])
  p.adb.235r.expand-        (reg:SI 12 ip [ CHAIN$45 ])) "p.adb":24:7 -1
  p.adb.235r.expand-     (nil))

by expand_function_start, then becomes

  p.adb.279r.reload:(insn 3 320 4 2 (set (mem/c:SI (plus:SI (reg/f:SI 11 fp)
  p.adb.279r.reload-                (const_int -76 [0xffffffffffffffb4])) [19 
%sfp+-36 S4 A32])
  p.adb.279r.reload-        (reg:SI 12 ip [ CHAIN$47 ])) "p.adb":29:7 179 
{*arm_movsi_insn}
  p.adb.279r.reload-     (nil))



The core issue is a disagreement between register elimination
(which turns %sfp-36 into %fp-76) and prologue expansion regarding
static_chain_stack_bytes(), used both in arm_compute_frame_layout()
and arm_expand_prologue().

arm_compute_static_chain_stack_bytes() returns 0 during the whole
of register allocation, up to the end of register elimination and
reload_completed flipped to true.

The key factor is arm_r3_live_at_start_p() returning false all along.

Then we very quickly get to:

  if (optimize)
     df_analyze ();

in do_reload(), and arm_r3_live_at_start_p () starts returning true.

arm_expand_prologue proceeds on this basis and decides to
use a slot just ahead of "the stack frame" to temporarily store
IP.

Even though this store isn't really part of the "register save"
operations per se, the code sort of pretends it is with:

          gcc_assert(arm_compute_static_chain_stack_bytes() == 4);
          saved_regs += 4;

which later on shortens the amount of allocated stack by:

  if (offsets->outgoing_args != offsets->saved_args + saved_regs)
    { ...
      amount = GEN_INT (offsets->saved_args + saved_regs
                        - offsets->outgoing_args);

All in all, expand_prologue operates from a view of the stack
layout which doesn't match what register elimination has seen,
and the sfp->fp elimination is off by 4.


The first idea that came to mind was that relying on
r3_live_at_start_p during register elimination seems
fragile, and maybe the prologue expansion should
use a different spot for the ip temp storage. 

The prologue logic is complex though, so I'm not 100% sure
this is a reasonable track.

Another idea was to detect the perception mismatch and
either skip the "use slot above frame" option in this case,
or arrange to adjust the final stack pointer update accordingly.

The first of these last two options seems cleaner as it
goes towards a more precise synchronisation between the layout
assumed by register elimination and the layout actually setup
by the prologue expander.

But I haven't yet experimented with it and I'm not certain
how viable that would be in principle.

The last option I have thought of was to perform another round
of register elimination just for offsets adjustments after
df_analyze. Not sure if that makes much sense (are we sure that
this would never invalidate dataflow analysis ?)

Thoughts ? Taker for a PR ?

Thanks much in advance,

With Kind Regards,

Olivier

-------------

package P is
  procedure Compute_Blocks (Blocks  : out Integer);   
end;

package body P is

  function Is_White_Space (Char : Integer) return Boolean;
  pragma Inline (Is_White_Space);

  function Is_White_Space (Char : Integer) return Boolean is
  begin
     return    Char = 16#20#
       or else Char = 16#09#
       or else Char = 16#0A#
       or else Char = 16#0D#;
  end Is_White_Space;

  procedure Read
    (Str   : String;
     Index : in out Positive;
     Char  : out Integer) with Import;

  function Foo return String with Import;

  procedure Compute_Blocks (Blocks  : out Integer)
  is
     procedure On_Item (Str : String) is
     begin
        Blocks := 66;
     end;

     procedure For_Each_Item (Ch : String) is
        Index : Integer := Ch'First;
        Last, Start  : Integer;
        C     : Integer;
     begin
        while Index <= Ch'Last loop
           --  Skip leading spaces
           while Index <= Ch'Last loop
              Start := Index;
              Read (Ch, Index, C);
              exit when not Is_White_Space (C);
              Start := Ch'Last + 1;
           end loop;

           exit when Start > Ch'Last;

           --  Move to first whitespace after word
           while Index <= Ch'Last loop
              Last := Index;
              Read (Ch, Index, C);
              exit when Index > Ch'Last or else Is_White_Space (C);
           end loop;

           if Index > Ch'Last and then not Is_White_Space (C) then
              On_Item (Ch (Start .. Index - 1));
              exit;
           else
              On_item (Ch (Start .. Last - 1)); --  Last points to a whitespace
           end if;
        end loop;
     end For_Each_Item;

  begin
     Blocks := 0;
     For_Each_Item (Foo);
  end Compute_Blocks;

end;

Reply via email to