So I just did a test. Took the following rust code:
pub fn test_wrap(x : u32, y : u32) -> u32 {
return x.checked_mul(&y).unwrap().checked_add(&16).unwrap();
}
And got the following blob of assembly out. What we have there, my friends,
is a complete failure of the optimizer (N.B. it works for the simple case
of checked_add alone)
Preamble:
__ZN9test_wrap19hc4c136f599917215af4v0.0E:
.cfi_startproc
cmpl %fs:20, %esp
ja LBB0_2
pushl $12
pushl $20
calll ___morestack
ret
LBB0_2:
pushl %ebp
Ltmp2:
.cfi_def_cfa_offset 8
Ltmp3:
.cfi_offset %ebp, -8
movl %esp, %ebp
Ltmp4:
.cfi_def_cfa_register %ebp
Align stack (for what? We don't do any SSE)
andl $-8, %esp
subl $16, %esp
Multiply x * y
movl 12(%ebp), %eax
mull 16(%ebp)
jno LBB0_4
If it didn't overflow, stash a 0 at top of stack
movb $0, (%esp)
jmp LBB0_5
If it did overflow, stash a 1 at top of stack (we are building an
Option<u32> here)
LBB0_4:
movb $1, (%esp)
movl %eax, 4(%esp)
Take pointer to &this for __thiscall:
LBB0_5:
leal (%esp), %ecx
calll __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
Do the addition to the result
addl $16, %eax
Repeat the previous circus
jae LBB0_7
movb $0, 8(%esp)
jmp LBB0_8
LBB0_7:
movb $1, 8(%esp)
movl %eax, 12(%esp)
LBB0_8:
leal 8(%esp), %ecx
calll __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
movl %ebp, %esp
popl %ebp
ret
.cfi_endproc
Yeah. Its' not fast because its' not inlining through option::unwrap.
I'm not sure what can be done for this, and whether its' on the LLVM side
or the Rust side of things. My first instinct: find out what happens when
fail! is moved out-of-line from unwrap() into its' own function (especially
if that function can be marked noinline!), because optimizers often choke
around EH.
I tried to test the "optimal" situation in a synthetic benchmark:
https://gist.github.com/oshepherd/8376705
(In C for expediency. N.B. you must set core affinity before running this
benchmark because I hackishly just read the TSC. i386 only.)
but the results are really bizzare and seem to have a multitude of
affecting factors (For example, if you minimally unroll and have the JCs
jump straight to abort, you get vastly different performance from jumping
to a closer location and then onwards to abort. Bear in mind that the
overflow case never happens during the test). It would be interesting to do
a test in which a "trivial" implementation of trap-on-overflow is added to
rustc (read: the overflow case just jumps straight to abort or similar, to
minimize optimizer influence and variability) to see how defaulting to
trapping ints affects real world workloads.
I wonder what level of performance impact would be considered "acceptable"
for improved safety by default?
Mind you, I think that what I'd propose is that i32 = Trapping, i32w =
wrapping, i32s = saturating, or something similar
Owen Shepherd
http://owenshepherd.net | [email protected]
On 11 January 2014 19:33, Daniel Micay <[email protected]> wrote:
> On Sat, Jan 11, 2014 at 11:54 AM, Owen Shepherd <[email protected]>
> wrote:
> > On 11 January 2014 06:20, Daniel Micay <[email protected]> wrote:
> >>
> >> The branch on the overflow flag results in a very significant loss in
> >> performance. For example, I had to carefully write the vector `push`
> >> method for my `Vec<T>` type to only perform one overflow check. With
> >> two checks, it's over 5 times slower due to failed branch predictions.
> >
> >
> > What did the generated code look like? I suspect that LLVM wasn't
> generating
> > optimal code, perhaps because Rust wasn't giving it appropriate hints or
> > because of optimizer bugs. For reference, on AMD64 the code should look
> > something like the following hypothetical code:
> >
> > vec_allocate:
> > MOV $SIZE, %eax
> > MUL %rsi
> > JC Lerror
> > ADD $HEADER_SIZE, %rax
> > JC Lerror
> > MOV %rax, %rsi
> > JMP malloc
> > Lerror:
> > // Code to raise error here
> >
> > Note that the ordering is EXTREMELY important! x86 doesn't give you any
> > separate branch hints (excluding two obsolete ones which only the
> Pentium IV
> > ever cared about) so your only clue to the optimizer is the branch
> > direction.
> >
> > I suspect your generated code had forward branches for the no overflow
> case.
> > Thats absolutely no good (codegen inerting "islands" of failure case
> code);
> > it will screw up the branch predictor.
> >
> > x86 defaults to predicting all (conditional) forward jumps not taken, all
> > conditional backwards jumps taken (Loops!). If the optimizer wasn't
> informed
> > correctly, it will probably not have obeyed that.
> >
> > Being as the overflow case should basically be never hit, there is no
> reason
> > for it to ever be loaded into the optimizer, so that is good
> >
> > (P.S. If the rust compiler is really good it'll convince LLVM to put the
> > error case branch code in a separate section so it can all be packed
> > together far away from useful cache lines and TLB entries)
>
> Rust directly exposes the checked overflow intrinsics so these are
> what was used. It already considers branches calling a `noreturn`
> function to be colder, so adding an explicit branch hint (which is
> easy enough via `llvm.expect` doesn't help). Feel free to implement it
> yourself if you think you can do better. The compiler work is already
> implemented. I doubt you'll get something performing in the same
> ballpark as plain integers.
>
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev