Re: [rust-dev] Appeal for CORRECT, capable, future-proof math, pre-1.0

Owen Shepherd Sat, 11 Jan 2014 13:32:17 -0800

So I just did a test. Took the following rust code:
pub fn test_wrap(x : u32, y : u32) -> u32 {
    return x.checked_mul(&y).unwrap().checked_add(&16).unwrap();
}


And got the following blob of assembly out. What we have there, my friends,
is a complete failure of the optimizer (N.B. it works for the simple case
of checked_add alone)

Preamble:

__ZN9test_wrap19hc4c136f599917215af4v0.0E:
    .cfi_startproc
    cmpl    %fs:20, %esp
    ja    LBB0_2
    pushl    $12
    pushl    $20
    calll    ___morestack
    ret
LBB0_2:
    pushl    %ebp
Ltmp2:
    .cfi_def_cfa_offset 8
Ltmp3:
    .cfi_offset %ebp, -8
    movl    %esp, %ebp
Ltmp4:
    .cfi_def_cfa_register %ebp

Align stack (for what? We don't do any SSE)

    andl    $-8, %esp
    subl    $16, %esp

Multiply x * y

    movl    12(%ebp), %eax
    mull    16(%ebp)
    jno    LBB0_4

If it didn't overflow, stash a 0 at top of stack

    movb    $0, (%esp)
    jmp    LBB0_5

If it did overflow, stash a 1 at top of stack (we are building an
Option<u32> here)
LBB0_4:
    movb    $1, (%esp)
    movl    %eax, 4(%esp)

Take pointer to &this for __thiscall:
LBB0_5:
    leal    (%esp), %ecx
    calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E

Do the addition to the result

    addl    $16, %eax

Repeat the previous circus

    jae    LBB0_7
    movb    $0, 8(%esp)
    jmp    LBB0_8
LBB0_7:
    movb    $1, 8(%esp)
    movl    %eax, 12(%esp)
LBB0_8:
    leal    8(%esp), %ecx
    calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
    movl    %ebp, %esp
    popl    %ebp
    ret
    .cfi_endproc


Yeah. Its' not fast because its' not inlining through option::unwrap.


I'm not sure what can be done for this, and whether its' on the LLVM side
or the Rust side of things. My first instinct: find out what happens when
fail! is moved out-of-line from unwrap() into its' own function (especially
if that function can be marked noinline!), because optimizers often choke
around EH.



I tried to test the "optimal" situation in a synthetic benchmark:
https://gist.github.com/oshepherd/8376705
(In C for expediency. N.B. you must set core affinity before running this
benchmark because I hackishly just read the TSC. i386 only.)


but the results are really bizzare and seem to have a multitude of
affecting factors (For example, if you minimally unroll and have the JCs
jump straight to abort, you get vastly different performance from jumping
to a closer location and then onwards to abort. Bear in mind that the
overflow case never happens during the test). It would be interesting to do
a test in which a "trivial" implementation of trap-on-overflow is added to
rustc (read: the overflow case just jumps straight to abort or similar, to
minimize optimizer influence and variability) to see how defaulting to
trapping ints affects real world workloads.

I wonder what level of performance impact would be considered "acceptable"
for improved safety by default?

Mind you, I think that what I'd propose is that i32 = Trapping, i32w =
wrapping, i32s = saturating, or something similar

Owen Shepherd
http://owenshepherd.net | [email protected]


On 11 January 2014 19:33, Daniel Micay <[email protected]> wrote:

> On Sat, Jan 11, 2014 at 11:54 AM, Owen Shepherd <[email protected]>
> wrote:
> > On 11 January 2014 06:20, Daniel Micay <[email protected]> wrote:
> >>
> >> The branch on the overflow flag results in a very significant loss in
> >> performance. For example, I had to carefully write the vector `push`
> >> method for my `Vec<T>` type to only perform one overflow check. With
> >> two checks, it's over 5 times slower due to failed branch predictions.
> >
> >
> > What did the generated code look like? I suspect that LLVM wasn't
> generating
> > optimal code, perhaps because Rust wasn't giving it appropriate hints or
> > because of optimizer bugs. For reference, on AMD64 the code should look
> > something like the following hypothetical code:
> >
> > vec_allocate:
> > MOV $SIZE, %eax
> > MUL %rsi
> > JC Lerror
> > ADD $HEADER_SIZE, %rax
> > JC Lerror
> > MOV %rax, %rsi
> > JMP malloc
> > Lerror:
> > // Code to raise error here
> >
> > Note that the ordering is EXTREMELY important! x86 doesn't give you any
> > separate branch hints (excluding two obsolete ones which only the
> Pentium IV
> > ever cared about) so your only clue to the optimizer is the branch
> > direction.
> >
> > I suspect your generated code had forward branches for the no overflow
> case.
> > Thats absolutely no good (codegen inerting "islands" of failure case
> code);
> > it will screw up the branch predictor.
> >
> > x86 defaults to predicting all (conditional) forward jumps not taken, all
> > conditional backwards jumps taken (Loops!). If the optimizer wasn't
> informed
> > correctly, it will probably not have obeyed that.
> >
> > Being as the overflow case should basically be never hit, there is no
> reason
> > for it to ever be loaded into the optimizer, so that is good
> >
> > (P.S. If the rust compiler is really good it'll convince LLVM to put the
> > error case branch code in a separate section so it can all be packed
> > together far away from useful cache lines and TLB entries)
>
> Rust directly exposes the checked overflow intrinsics so these are
> what was used. It already considers branches calling a `noreturn`
> function to be colder, so adding an explicit branch hint (which is
> easy enough via `llvm.expect` doesn't help). Feel free to implement it
> yourself if you think you can do better. The compiler work is already
> implemented.  I doubt you'll get something performing in the same
> ballpark as plain integers.
>

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] Appeal for CORRECT, capable, future-proof math, pre-1.0

Reply via email to