Re: [MIT-Scheme-devel] Rework amd64 ABI -- flag day

Chris Hanson Sat, 05 Jan 2019 18:00:07 -0800

I'm going to put out a new bug-fix release this weekend, so please not
until that's finished.


Otherwise, just choose a day and let us know. And, of course, tell us how
to do the change since we'll each need to do this.

It sounds like this will break our rule about being able to build the next
release from the current one. Is that right? I don't object, but it will
have to eventually be explained at the time of release. Unless the
changeover can be scripted so that the rule isn't broken.

On Sat, Jan 5, 2019 at 12:16 AM Taylor R Campbell <campb...@mumble.net>
wrote:

> Summary:
>
> I would like to merge some incompatible changes to the amd64 compiled
> code ABI that substantially improve system performance all around,
> especially for closure-heavy code.  It shaves about 20% off the test
> suite time (with FAST=y); some workloads run 5x or 10x faster.  The
> branch is riastradh-20181220-closentry-v2.
>
> But...this will require a flag day -- you'll have to cross-compile a
> new system from an old one, which is a pain, though less of a pain now
> that cross-fasdump-v2 is merged.
>
> When would be a good time to do this?
>
>
> Full story:
>
> Back when I wrote the amd64 back end a decade ago, I made it work, but
> I ran out of steam to make it work well, and then we all started using
> it so it was a pain to change, so there were some bad decisions
> embedded in it.
>
> For example, the x86 representation of closures -- with embedded CALL
> instructions to jump to the real code and push the pointer to the data
> on the stack -- might have made sense in 1990, but it's about the
> worst way imaginable to abuse modern CPUs:
>
> - consing a closure entails writing instruction memory that the first
>   invocation will have to load into the instruction cache (and reload
>   after every GC),
> - the CALL is never paired with a RET so it breaks return address
>   branch target prediction,
> - and to top it off nearly every unknown procedure call shares a
>   single JMP instruction (per arity) in cmpauxmd.s (x86-64.m4) so the
>   CPU can't discriminate to refine indirect branch target prediction.
>
> Recently this was bugging me, so I took some time to fix some of the
> mistakes I'd made and confirm they give various workloads substantial
> performance improvements.
>
> Merging all these changes will require a flag day: we'll have to build
> a new system with a cross-compiler, and old coms won't work.  The
> cross-fasdump-v2 branch will make that process less painful, but it's
> still a flag day.
>
> Here's the changes I'm sitting on that require a flag day:
>
> - Split the compiled-entry type code into compiled-entry and
>   compiled-return, so that calling conventions for heap closures and
>   stack continuations don't compete on engineering constraints.
>
>   Either tag can still refer to any kind of compiled code:
>   expressions, continuations, open or global procedures; this is
>   convenient for, e.g., interrupt checks, where we may need to restart
>   a procedure from the top or return to a continuation, and convenient.
>
>   This change affects all architectures, though only amd64 takes
>   advantage of it at the moment by choosing the following
>   representations:
>
>   . Compiled-entry is a pointer to an _offset_ to instructions.
>
>     => This way a compiled closure is just an array of offsets, and
>        consing a closure entails writing no new instruction memory and
>        invoking a newly consed closure entails no instruction cache
>        reloads.
>
>     => This makes a big difference for closure-heavy code -- 5x speed
>        improvements in some of my microbenchmarks.  And execute caches
>        (`uuo-links') can optimize away the indirection for anything
>        except closures (and maybe could optimize those away too with
>        some more help from the GC).
>
>     => There is a cost: unknown calls to fixed, long-lived, often-used
>        closures or to open procedures are marginally costlier because
>        of the additional memory indirection, but I haven't measured
>        this to be over 5% even in microbenchmarks that are nothing
>        _but_ these cases.  So I think this tradeoff is well worth it.
>
>   . Compiled-return is a pointer to a _jump instruction_.
>
>     => This way we can use CALL to push a continuation, and a paired
>        RET to return to it, which means we can take advantage of the
>        CPU's return address branch target predictor.
>
>     => The jump instruction has to go to a compiled entry with zero
>        offset so we can still parse continuations just like before.
>        It'll look like this:
>
>         ;; Procedure call.
>         (CALL (@PCR pushed))
>         (JMP (@PCR continuation))          ; Return address points here.
> pushed:
>         (OR (@R ,rsp) (& ,tag))            ; Tag the continuation.
>         (PUSH Q (& 123))                   ; Push additional arguments.
>         ...
>         (JMP (@PCR uuo-link))              ; Call procedure.
>         (WORD ...)                         ; Continuation entry metadata.
>         (BLOCK-OFFSET continuation)
>         (QUAD U 0)                         ; Zero offset to instructions.
> continuation:
>         (PUSH Q (R 0))                     ; Start of continuation code.
>
>         ;; Procedure return (ignoring interrupt checks).
>         (AND Q (@R ,rsp) (R ,rbp))         ; Detag continuation.
>         (RET)                              ; Pop and return.
>
> - For INVOCATION:APPLY unknown procedure calls, call an assembly hook
>   subroutine to set up an application, but generate a JMP instruction:
>
>         (POP Q (R ,rbx))                   ; Pop procedure into rbx.
>         (CALL (@RO ,regs #x128))           ; Call assembly hook.
>         ;; The assembly hook sets rax to the address of the
>         ;; procedure's starting instruction if it can, or to another
>         ;; assembly hook that defers to the interpreter if it can't.
>         (JMP (R ,rax))
>
>   This way the CPU can train each call site with a different branch
>   target prediction, rather than merging them all in an assembly hook
>   that executes the JMP itself.  Gave about 2x speed increase on some
>   microbenchmarks.
>
> - Tweak some other compiler utilities to use CALL/RET responsibly.
>   Interrupt routines don't because they're always going to wreck the
>   return address branch prediction stack by returning to microcode
>   anyway, and it's more convenient for them to have the entry address.
>   On architectures where compiled entries and compiled returns have
>   the same representation this makes no difference.
>
> - Use rax, not a slot in the registers block in memory, as the return
>   value register -- but still let the machine register allocator hand
>   out rax.
>
>   . Requires some changes to the RTL generator to make sure we read
>     the value register first thing in a continuation, or write it last
>     thing before a return.
>     => As it happens, there was already a bug here with interpreter
>        call results (which go in rax too) and procedures with dynamic
>        links.
>     => This is just a matter of inserting some temporaries, which
>        evaporate into aliases for the RTL register allocator.
>
>   . Requires some changes to the RTL optimizer to avoid propagating
>     available machine registers in CSE and code compression so that
>     the read and write stay at the beginning or end of a block.
>
>   . Requires some changes to the LAP generator to avoid offering rax
>     for allocation when it is the target of the RTL instruction we are
>     working on right now.
>
>   . Requires some minor changes to the assembly hooks to avoid
>     clobbering the return value, e.g. trampolines now use r9b instead
>     of al for the trampoline code.
>
> - Open-code with-interrupt-mask and with-interrupts-reduced (in the
>   LAP generator, not in the RTL generator) so that we don't have to go
>   through reflect-to-interface, which breaks the return address branch
>   target predictor.  This gave about a 2x performance improvement on
>   promise operations (after I sped them up a lot by removing one of
>   the without-interruptses already).
>
> My branch seems to be pretty stable now and I'm satisfied with all the
> ABI changes (much as it might be cute -- and not too hard now with the
> portable fasdumper -- to also try migrating to a NaN-tagging scheme),
> so I'd like to find a convenient time to merge it.  Thoughts?
>
> (Review also welcome before I merge!)
>
> _______________________________________________
> MIT-Scheme-devel mailing list
> MIT-Scheme-devel@gnu.org
> https://lists.gnu.org/mailman/listinfo/mit-scheme-devel
>

_______________________________________________
MIT-Scheme-devel mailing list
MIT-Scheme-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/mit-scheme-devel

Re: [MIT-Scheme-devel] Rework amd64 ABI -- flag day

Reply via email to