[RFC] [PR 68191] s390: Add -fsplit-stack support.

Marcin Kościelnicki Sat, 02 Jan 2016 11:17:47 -0800

Here's my attempt at adding -fsplit-stack support for s390 targets
(bug 68191).  Patches 1 and 2 fix s390-specific issues affecting split
stack code, and can be pushed independently of the main course.  Patches
3 and 4 attempt to fix target-independent issues involving unconditional
jumps with side effects (see below).  I'm not exactly sure I'm doing
the right thing in these, and I'd really welcome some feedback about
them and the general approach taken.  Patch 5 is split stack support
proper.  This patch should be used along with the matching glibc and
gold patches (I'll soon link them all in the bugzilla entry).


The generic approach is identical to x86: I add a new __private_ss
field to the TCB in glibc, add a target-specific __morestack function
and friends, emit a split-stack prologue, teach va_start to deal with
a dedicated vararg pointer, and teach gold to recognize the split-stack
prologue and handle non-split-stack calls by bumping the requested
frame size.

The differences start in the __morestack calling convention.  Basically,
since pushing things on stuck is unwieldy and there's only one free
register (%r0 could be used for static chain, %r2-%r6 contain arguments,
%r6-%r15 are callee-saved), I stuff the parameters somewhere in .rodata
or .text section, and pass the address of the parameter block in %r1.
The parameter block also contains a (position-relative) address that
__morestack should jump to (x86 just mangles the return address from
__morestack to compute that).  On zSeries CPUs, the parameter block
is stuffed somewhere in .rodata, its address loaded to %r1 by larl
instruction, and __morestack is sibling-called by jg instruction.
On older CPUs, lacking long jump and PC-relative load-address
instructions, I use the following sequence instead:

# load .L1 to %r1
basr %r1, 0 
.L1:
# Load __morestack to %r1
a %r1, .L2-.L1(%r1)
# Jump to __morestack and stuff return address (aka param block address)
# to %r1.
basr %r1, %r1
# param block comes here
.L3:
.long <frame_size>
.long <args_size>
.long .L4-.L3
# relative __morestack address here
.L2:
.long __morestack-.L1
.L4:
# __morestack jumps here

As on other targets, the call to __morestack is conditional, based
on comparing the stack pointer with a field in TCB.  For zSeries,
I just make the jump to __morestack a conditional one, while for
older CPUs I emit a jump over the sequence.  Also, for vararg
functions, I need to stuff the vararg pointer in some register. Since
%r1 is again the only one guaranteed to be free, it's the one used.
If __morestack is called, it'll leave the correct pointer in %r1.
Otherwise, I emit a simple load-address instruction.  Since I only
need that instruction in the not-called branch (as opposed to x86
that emits it on both branches), I get terser code.

Now, here come the problems.  To keep optimization passes from
destroying the above sequence (as well as the simpler ones with larl),
I emit a pseudo-insn (split_stack_call_*) that is expanded to the
above in machine-dependent reorg phase, just like normal const pools.
The instruction is considered to be an unconditional jump to the .L4
label (since __morestack will jump to an arbitrary address selected
by param block anyway, that's what it effectively is).  For a zSeries
CPU with a conditional call, I represent the sequence as a conditional
jump instead.  So overall the sequences, as emitted by
s390_expand_split_stack_prologue, look as follows:

# (1) Old CPU, unconditional
<call __morestack using basr as above, jump to .L4>
.L4:
# Normal prologue starts here.

# (2) zSeries CPU, unconditional
<call __morestack using larl+jg, jump to .L4>
.L4:
# Normal prologue starts here.

# Which will expand to:
larl %r1, .L3
jg __morestack
.section .rodata
.L3:
# Or .long for 31-bit target.
.quad <frame_size>
.quad <args_size>
.quad .L4-.L3
.text

# (3) Old CPU, conditional
<load and compare the guard against stack pointer - nothing interesting>
jhe .L5
<call __morestack using basr, jump to .L4>
.L5:
# Compute vararg pointer (vararg functions only)
la %r1, 96(%r15)
.L4:
# Normal prologue starts here.

# (4) zSeries CPU, conditional
<load and compare the guard against stack pointer>
<conditionally call __morestack using larl+jgl, if called jump to .L4>
# Compute vararg pointer (vararg functions only)
la %r1, 160(%r15)
.L4:
# Normal prologue starts here.
# Expands as above, except with jgl instead of jg.

Case (4) is the least problematic: conditional jumps with side effects
appear to work quite well.  However, the other variants involve an
unconditional jump with side effects, which causes two problems:

- If the jump is to immediately following label (which will happen always
  in cases (1) and (2), and for non-vararg functions in (3)),
  rtl_tidy_fallthru_edge mistakenly marks it as a fallthru edge, even
  though it correctly figures the jump cannot be removed due to the side
  effects.  This causes a verification failure later.
- In case (3), since the call to __morestack is considered to be unlikely,
  the basic block with the call pseudo-insn will be moved to the end of
  the function if we're optimizing.  Since it already ends with
  an unconditional jump, no new jump will be inserted (as opposed to x86).
  Soon afterwards, reposition_prologue_and_epilogue_notes will move
  NOTE_INSN_PROLOGUE_END after the last prologue instruction, which is now
  our pseudo-jump.  Unfortunately, it doesn't consider the possibility of
  it being an unconditional jump, and stuffs the note right between the
  jump and the following barrier, again causing a verification failure.

Patches 3 and 4 of the patchset attempt to fix the above problems.
For the first one, I just skip the edge if it involves an unconditional
jump with side effects.  For the second, I carefully extract the note
from its basic block and put it after the barrier.  I'm not sure any
of it is the right approach, and would welcome any feedback.

I've also found a target-independent issue with -fsplit-stack: suppose
we're compiling with -fsplit-stack and -fprofile-use or some other option
that will partition the code into hot and cold sections.  Further suppose
that the code that ends up in .text.unlikely involves a function call
aiming at a function compiled without -fsplit-stack.  In that case, the
linker should obviously perform the necessary transforms on the function
prologue to bump its frame-size.  However, since the code in
.text.unlikely doesn't really belong to function foo according to the
symbol table, one of the following happens instead:

- x86: since foo.cold.0 is not a function (STT_NOTYPE), it's not scanned
  for calls to -fno-split-stack functions, and may easily result in
  a stack overflow at runtime.
- s390: since foo.cold.0 *is* a function (STT_FUNCT), it's scanned for
  such calls, and linker tries to modify foo.cold.0's split-stack
  prologue.  This fails with a linker error, since it obviously doesn't
  have one.

I have no idea what to do about that.  Since mixing split-stack code with
-fno-split-stack is horribly broken in many ways, I'm tempted to just
ignore the problem.

A few other non-obvious problems and notes:

- For old CPUs, in case (3), optimization will move the call to the end
  of the function... but since branches on s390 reach only 4kiB in either
  direction, we s390_split_branches may attempt to split the branch to
  that block, which would fail horribly since it's before proper prologue
  and we cannot clobber %r14.  I detect this case and move the basic block
  back to its original location instead.
- Likewise, s390_split_branches needed to be taught not to look at the
  __morestack call pseudo-insn (which is considered a jump).  It'd only
  get confused.
- s390_chunkify_start is responsible for reloading the const pool register
  when branches are made between portions of a function using different
  const pools.  In case (3), we likewise cannot do that, since %r13 cannot
  be clobbered yet.  I just disable emitting the const pool reload in this
  case.
- The (ordinary) prologue needs a temp register for its own use.  As per
  the above rationale, it also tends to pick %r1, which collides with us
  using it for the vararg pointer.  There already was a condition that
  picks %r14 instead, if possible.  I amended it to pick %r12 if %r1
  would be picked in a vararg split-stack function, and modified
  s390_register_info to consider it clobbered in this case.
- For leaf functions, there's a possibility that frame_size will be 0.
  In this case, there's no point in doing the __morestack dance.  However,
  we need some way to tell a split-stack function apart in the linker
  and perhaps at runtime as well, if non-split function-pointer calls are
  ever implemented.  We may be able to get away without that, but just in
  case, I emit a funny nop (nopr %r15) instead of split-stack prologue
  in such functions to mark them (both x86 and ppc always emit
  a split-stack prologue and I'd feel uneasy if I didn't include one).
- I use a conditional __morestack call if frame_size fits in an add
  immediate instruction (16-bit signed if the CPU doesn't have extended
  immediate instructions, 32-bit if it does), unconditional otherwise
  (__morestack will check anyway, but there's not much chance of already
  having such a big frame).
- gold will try bumping the immediate field in the above add instruction
  if it's present and the frame size still fits, and will nop out the
  comparison and convert to an unconditional call otherwise.  It'll
  always bump the frame size in the parameter block.  Thanks to that, we
  don't need a separate __morestack_nonsplit function like x86.
- If -pg is used together with -fsplit-stack, the call to _mcount will
  be emitted before the split-stack prologue (as opposed to x86, which
  emits it after the prologue).  This is not a big problem, but gold
  needs to account for that and recognize the _mcount call before
  the split-stack prologue.

I have run the testsuite on a z13 machine.  In addition to running it
with -fsplit-stack, I've also run it with s390_expand_split_stack_prologue
modified to always emit unconditional calls (to exercise more paths
in __morestack).  There are a few new failures, but they can all be
explained:

- the testcases for __builtin_return_address and friends hit __morestack's
  stack frame instead of whatever they were hoping to find.
- guality tests all break since gdb looks at __morestack's frame instead
  of the one that called it.  Marking guality_check with __attribute__
  ((no_split_stack)) made them go away, though a better fix would be
  to make gdb skip __morestack frames somehow...
- some guality tests try printing function arguments after an alloca
  or VLA allocation with optimization.  These no longer work, since
  the arguments are in caller-saved registers, and a call to
  __morestack_allocate_stack_space will destroy them.
- the .text.unlikely issue mentioned above.

[RFC] [PR 68191] s390: Add -fsplit-stack support.

Reply via email to