Here's my attempt at adding -fsplit-stack support for s390 targets (bug 68191). Patches 1 and 2 fix s390-specific issues affecting split stack code, and can be pushed independently of the main course. Patches 3 and 4 attempt to fix target-independent issues involving unconditional jumps with side effects (see below). I'm not exactly sure I'm doing the right thing in these, and I'd really welcome some feedback about them and the general approach taken. Patch 5 is split stack support proper. This patch should be used along with the matching glibc and gold patches (I'll soon link them all in the bugzilla entry).
The generic approach is identical to x86: I add a new __private_ss field to the TCB in glibc, add a target-specific __morestack function and friends, emit a split-stack prologue, teach va_start to deal with a dedicated vararg pointer, and teach gold to recognize the split-stack prologue and handle non-split-stack calls by bumping the requested frame size. The differences start in the __morestack calling convention. Basically, since pushing things on stuck is unwieldy and there's only one free register (%r0 could be used for static chain, %r2-%r6 contain arguments, %r6-%r15 are callee-saved), I stuff the parameters somewhere in .rodata or .text section, and pass the address of the parameter block in %r1. The parameter block also contains a (position-relative) address that __morestack should jump to (x86 just mangles the return address from __morestack to compute that). On zSeries CPUs, the parameter block is stuffed somewhere in .rodata, its address loaded to %r1 by larl instruction, and __morestack is sibling-called by jg instruction. On older CPUs, lacking long jump and PC-relative load-address instructions, I use the following sequence instead: # load .L1 to %r1 basr %r1, 0 .L1: # Load __morestack to %r1 a %r1, .L2-.L1(%r1) # Jump to __morestack and stuff return address (aka param block address) # to %r1. basr %r1, %r1 # param block comes here .L3: .long <frame_size> .long <args_size> .long .L4-.L3 # relative __morestack address here .L2: .long __morestack-.L1 .L4: # __morestack jumps here As on other targets, the call to __morestack is conditional, based on comparing the stack pointer with a field in TCB. For zSeries, I just make the jump to __morestack a conditional one, while for older CPUs I emit a jump over the sequence. Also, for vararg functions, I need to stuff the vararg pointer in some register. Since %r1 is again the only one guaranteed to be free, it's the one used. If __morestack is called, it'll leave the correct pointer in %r1. Otherwise, I emit a simple load-address instruction. Since I only need that instruction in the not-called branch (as opposed to x86 that emits it on both branches), I get terser code. Now, here come the problems. To keep optimization passes from destroying the above sequence (as well as the simpler ones with larl), I emit a pseudo-insn (split_stack_call_*) that is expanded to the above in machine-dependent reorg phase, just like normal const pools. The instruction is considered to be an unconditional jump to the .L4 label (since __morestack will jump to an arbitrary address selected by param block anyway, that's what it effectively is). For a zSeries CPU with a conditional call, I represent the sequence as a conditional jump instead. So overall the sequences, as emitted by s390_expand_split_stack_prologue, look as follows: # (1) Old CPU, unconditional <call __morestack using basr as above, jump to .L4> .L4: # Normal prologue starts here. # (2) zSeries CPU, unconditional <call __morestack using larl+jg, jump to .L4> .L4: # Normal prologue starts here. # Which will expand to: larl %r1, .L3 jg __morestack .section .rodata .L3: # Or .long for 31-bit target. .quad <frame_size> .quad <args_size> .quad .L4-.L3 .text # (3) Old CPU, conditional <load and compare the guard against stack pointer - nothing interesting> jhe .L5 <call __morestack using basr, jump to .L4> .L5: # Compute vararg pointer (vararg functions only) la %r1, 96(%r15) .L4: # Normal prologue starts here. # (4) zSeries CPU, conditional <load and compare the guard against stack pointer> <conditionally call __morestack using larl+jgl, if called jump to .L4> # Compute vararg pointer (vararg functions only) la %r1, 160(%r15) .L4: # Normal prologue starts here. # Expands as above, except with jgl instead of jg. Case (4) is the least problematic: conditional jumps with side effects appear to work quite well. However, the other variants involve an unconditional jump with side effects, which causes two problems: - If the jump is to immediately following label (which will happen always in cases (1) and (2), and for non-vararg functions in (3)), rtl_tidy_fallthru_edge mistakenly marks it as a fallthru edge, even though it correctly figures the jump cannot be removed due to the side effects. This causes a verification failure later. - In case (3), since the call to __morestack is considered to be unlikely, the basic block with the call pseudo-insn will be moved to the end of the function if we're optimizing. Since it already ends with an unconditional jump, no new jump will be inserted (as opposed to x86). Soon afterwards, reposition_prologue_and_epilogue_notes will move NOTE_INSN_PROLOGUE_END after the last prologue instruction, which is now our pseudo-jump. Unfortunately, it doesn't consider the possibility of it being an unconditional jump, and stuffs the note right between the jump and the following barrier, again causing a verification failure. Patches 3 and 4 of the patchset attempt to fix the above problems. For the first one, I just skip the edge if it involves an unconditional jump with side effects. For the second, I carefully extract the note from its basic block and put it after the barrier. I'm not sure any of it is the right approach, and would welcome any feedback. I've also found a target-independent issue with -fsplit-stack: suppose we're compiling with -fsplit-stack and -fprofile-use or some other option that will partition the code into hot and cold sections. Further suppose that the code that ends up in .text.unlikely involves a function call aiming at a function compiled without -fsplit-stack. In that case, the linker should obviously perform the necessary transforms on the function prologue to bump its frame-size. However, since the code in .text.unlikely doesn't really belong to function foo according to the symbol table, one of the following happens instead: - x86: since foo.cold.0 is not a function (STT_NOTYPE), it's not scanned for calls to -fno-split-stack functions, and may easily result in a stack overflow at runtime. - s390: since foo.cold.0 *is* a function (STT_FUNCT), it's scanned for such calls, and linker tries to modify foo.cold.0's split-stack prologue. This fails with a linker error, since it obviously doesn't have one. I have no idea what to do about that. Since mixing split-stack code with -fno-split-stack is horribly broken in many ways, I'm tempted to just ignore the problem. A few other non-obvious problems and notes: - For old CPUs, in case (3), optimization will move the call to the end of the function... but since branches on s390 reach only 4kiB in either direction, we s390_split_branches may attempt to split the branch to that block, which would fail horribly since it's before proper prologue and we cannot clobber %r14. I detect this case and move the basic block back to its original location instead. - Likewise, s390_split_branches needed to be taught not to look at the __morestack call pseudo-insn (which is considered a jump). It'd only get confused. - s390_chunkify_start is responsible for reloading the const pool register when branches are made between portions of a function using different const pools. In case (3), we likewise cannot do that, since %r13 cannot be clobbered yet. I just disable emitting the const pool reload in this case. - The (ordinary) prologue needs a temp register for its own use. As per the above rationale, it also tends to pick %r1, which collides with us using it for the vararg pointer. There already was a condition that picks %r14 instead, if possible. I amended it to pick %r12 if %r1 would be picked in a vararg split-stack function, and modified s390_register_info to consider it clobbered in this case. - For leaf functions, there's a possibility that frame_size will be 0. In this case, there's no point in doing the __morestack dance. However, we need some way to tell a split-stack function apart in the linker and perhaps at runtime as well, if non-split function-pointer calls are ever implemented. We may be able to get away without that, but just in case, I emit a funny nop (nopr %r15) instead of split-stack prologue in such functions to mark them (both x86 and ppc always emit a split-stack prologue and I'd feel uneasy if I didn't include one). - I use a conditional __morestack call if frame_size fits in an add immediate instruction (16-bit signed if the CPU doesn't have extended immediate instructions, 32-bit if it does), unconditional otherwise (__morestack will check anyway, but there's not much chance of already having such a big frame). - gold will try bumping the immediate field in the above add instruction if it's present and the frame size still fits, and will nop out the comparison and convert to an unconditional call otherwise. It'll always bump the frame size in the parameter block. Thanks to that, we don't need a separate __morestack_nonsplit function like x86. - If -pg is used together with -fsplit-stack, the call to _mcount will be emitted before the split-stack prologue (as opposed to x86, which emits it after the prologue). This is not a big problem, but gold needs to account for that and recognize the _mcount call before the split-stack prologue. I have run the testsuite on a z13 machine. In addition to running it with -fsplit-stack, I've also run it with s390_expand_split_stack_prologue modified to always emit unconditional calls (to exercise more paths in __morestack). There are a few new failures, but they can all be explained: - the testcases for __builtin_return_address and friends hit __morestack's stack frame instead of whatever they were hoping to find. - guality tests all break since gdb looks at __morestack's frame instead of the one that called it. Marking guality_check with __attribute__ ((no_split_stack)) made them go away, though a better fix would be to make gdb skip __morestack frames somehow... - some guality tests try printing function arguments after an alloca or VLA allocation with optimization. These no longer work, since the arguments are in caller-saved registers, and a call to __morestack_allocate_stack_space will destroy them. - the .text.unlikely issue mentioned above.