On Sat, Jan 11, 2014 at 4:31 PM, Owen Shepherd <[email protected]> wrote: > So I just did a test. Took the following rust code: > pub fn test_wrap(x : u32, y : u32) -> u32 { > return x.checked_mul(&y).unwrap().checked_add(&16).unwrap(); > } > > And got the following blob of assembly out. What we have there, my friends, > is a complete failure of the optimizer (N.B. it works for the simple case of > checked_add alone) > > Preamble: > > __ZN9test_wrap19hc4c136f599917215af4v0.0E: > .cfi_startproc > cmpl %fs:20, %esp > ja LBB0_2 > pushl $12 > pushl $20 > calll ___morestack > ret > LBB0_2: > pushl %ebp > Ltmp2: > .cfi_def_cfa_offset 8 > Ltmp3: > .cfi_offset %ebp, -8 > movl %esp, %ebp > Ltmp4: > .cfi_def_cfa_register %ebp > > Align stack (for what? We don't do any SSE) > > andl $-8, %esp > subl $16, %esp
The compiler aligns the stack for performance. > Multiply x * y > > movl 12(%ebp), %eax > mull 16(%ebp) > jno LBB0_4 > > If it didn't overflow, stash a 0 at top of stack > > movb $0, (%esp) > jmp LBB0_5 > > If it did overflow, stash a 1 at top of stack (we are building an > Option<u32> here) > LBB0_4: > movb $1, (%esp) > movl %eax, 4(%esp) > > Take pointer to &this for __thiscall: > LBB0_5: > leal (%esp), %ecx > calll __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E > > Do the addition to the result > > addl $16, %eax > > Repeat the previous circus > > jae LBB0_7 > movb $0, 8(%esp) > jmp LBB0_8 > LBB0_7: > movb $1, 8(%esp) > movl %eax, 12(%esp) > LBB0_8: > leal 8(%esp), %ecx > calll __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E > movl %ebp, %esp > popl %ebp > ret > .cfi_endproc > > > Yeah. Its' not fast because its' not inlining through option::unwrap. The code to initiate failure is gigantic and LLVM doesn't do partial inlining by default. It's likely far above the inlining threshold. > I'm not sure what can be done for this, and whether its' on the LLVM side or > the Rust side of things. My first instinct: find out what happens when fail! > is moved out-of-line from unwrap() into its' own function (especially if > that function can be marked noinline!), because optimizers often choke > around EH. I was testing with `rust-core` and calling `abort`, as it doesn't use unwinding. > I tried to test the "optimal" situation in a synthetic benchmark: > https://gist.github.com/oshepherd/8376705 > (In C for expediency. N.B. you must set core affinity before running this > benchmark because I hackishly just read the TSC. i386 only.) > > > but the results are really bizzare and seem to have a multitude of affecting > factors (For example, if you minimally unroll and have the JCs jump straight > to abort, you get vastly different performance from jumping to a closer > location and then onwards to abort. Bear in mind that the overflow case > never happens during the test). It would be interesting to do a test in > which a "trivial" implementation of trap-on-overflow is added to rustc > (read: the overflow case just jumps straight to abort or similar, to > minimize optimizer influence and variability) to see how defaulting to > trapping ints affects real world workloads. > > I wonder what level of performance impact would be considered "acceptable" > for improved safety by default? > > Mind you, I think that what I'd propose is that i32 = Trapping, i32w = > wrapping, i32s = saturating, or something similar A purely synthetic benchmark only executing the unchecked or checked instruction isn't interesting. You need to include several optimizations in the loop as real code would use, and you will often see a massive drop in performance from the serialization of the pipeline. Register renaming is not as clever as you'd expect. The impact of trapping is known, because `clang` and `gcc` expose `-ftrapv`. Integer-heavy workloads like cryptography and video codecs are several times slower with the checks. _______________________________________________ Rust-dev mailing list [email protected] https://mail.mozilla.org/listinfo/rust-dev
