On Sat, Sep 12, 2015 at 11:38 PM, Anders Oleson <and...@openpuma.org> wrote:
>
> From examining the __morestack code, I found that the sigprocmask
> system call is being called (twice?!) per __morestack, even when it
> should just need to switch to the next allocated segment. I did read
> the reason for that change: to allow signal handlers to be
> split-stack, (ignoring that detail for the moment). A quick experiment
> shows that removing the calls to __morestack_block_signals and
> __morestack_unblock_signals brings the overhead of the hot split down
> to around 60 clocks which is much more reasonable.

It's not so much allowing signal handlers to be split stack.  It's
handling the case of a signal occurring while the stack is being
split.  That is not so very unlikely, and it will crash your program,
as you get a SIGSEGV while trying to handle whatever signal just
arrived.

gccgo avoids this overhead by marking all signal handlers as
SA_ONSTACK, calling sigaltstack, and calling
__splitstack_block_signals to tell the morestack code that it does not
need to block signals while splitting the stack.


> However in concept simply switching stack segments *should* not be
> hugely expensive. I made a proof-of-concept that only does the very
> minimal work to switch from one segment to another. This is done in
> assembler (conceptually in __morestack) and eliminates the call out to
> "C" on the likely hot path where the boundary has already been crossed
> and the next segment is already big enough. If you cache the details
> of the next/prev segments (if present) in the space right below the
> bottom (limit) of each stack segment, you can shrink the time down to
> 5-6 clocks. This is probably close to the achievable floor, which was
> in part what I was trying to find out.
>
> Summary:
>   prolog overhead, no call to __morestack : < 1 clock
>   stock call to __morestack (hot): > 4000 clocks
>   without signal blocking: < 60 clocks
>   potential best case: < 6 clocks

This sounds great.


> I have noticed that both Go and Rust have now abandoned the split
> stack approach due to performance issues. In Go, the idea to have
> zillions of tiny (go)co-routines or green threads is closer to my
> interest area than the Rust use. Even on x64, I think there are still
> reasons for wanting to break out of needing large linear stacks. Or it
> may be useful to other embedded applications. But in Go, apparently
> the fix is to copy the stack (all of it?) which seems pretty drastic,
> expensive and really tricky. At least it would only happen once. I was
> wondering if there was any thought into doing more work to optimize
> the -fsplit-stack? Does the Go stack-copy implementation have other
> issues?
>
> Another area I didn't explore was that certain leaf and small routines
> with known maximum stack usage could avoid needing the prolog.This
> might ameliorate much of the size issue.
>
> Bottom line is that I don't know whether this is something anyone
> still has any interest in, but in theory at least the "hot-split"
> problem could be improved significantly. At least I learned what I was
> trying to, and I put this out in case it is of use/interest to anyone.

I'm interested in this.  But I'm also interested in moving gccgo to
the stack copying approach.  Copying the stack should be cheaper than
splitting the stack for a long running program.  And, as Keith said,
it makes program performance more predictable, which is valuable in
itself.

Copying the stack requires knowing precisely every pointer on the
stack.  In Go this can be done, since Go has no unions.  In C/C++ it
requires moving all unions that combine pointers with non-pointers off
the stack.

Implementing this in GCC will require adding stackmaps with
pointer/non-pointer bits to the stack unwind information.  We'll need
the information both for the stack frame and for callee-saved
registers.  This will mean that REG_POINTER has to be accurate; I
don't know whether it really is.

Ian

Reply via email to