Re: [racket-dev] better x86 performance

2011-04-26 Thread Vincent St-Amour
Here are the numbers for Racket, Typed Racket, Gambit and Larceny on
32 bits, and without Larceny on 64 bits.

Overall, we're competitive, but we're losing pretty hard on deriv.

Vincent

fastestgambitlarcenyrackettyped-racketcpstack5715 ms1.1311.461.44dderiv6200 ms1.4811.491.51deriv3096 ms11.343.083.13div7434 ms12.161.161.11fft5903 ms12.011.491.11graphs7656 ms1.471.201.021lattice25863 ms2.0711.501.41maze27678 ms3.643.581.001mazefun11693 ms1.5211.081.06nfa6419 ms1.2611.231.03nqueens6835 ms1.0411.041.07paraffins6950 ms11.951.431.22tak7669 ms1.541.561.001takl7325 ms2.321.641.351triangle8960 ms1.001.1011.09
fastestgambitrackettyped-racketcpstack5332 ms11.501.47dderiv5981 ms11.331.30deriv3064 ms12.372.32div7014 ms11.581.56fft3830 ms11.471.16graphs6794 ms1.221.011lattice27250 ms1.311.061maze27280 ms2.461.041mazefun12295 ms1.101.091nfa6794 ms11.391.15nqueens5651 ms11.211.09paraffins8679 ms11.521.20tak7916 ms11.021.00takl8252 ms1.621.131triangle6862 ms11.211.27


At Sun, 24 Apr 2011 22:09:18 -0400,
Vincent St-Amour wrote:
 
 These are impressive speedups!
 
 Given how close we were to the fastest Scheme compilers on some of
 these, that may be enough to give us the lead.
 
 I'll run the benchmarks on different implementations tomorrow.
 
 Vincent
 
 
 At Sun, 24 Apr 2011 17:11:21 -0600,
 Matthew Flatt wrote:
  
  The `assoc' example helped focus my attention on a long-unsolved issue
  with JIT-generated code, where non-tail calls from JIT-generated code
  to other JIT-generated code seemed more expensive than they should be.
  This effect showed up in `assq' and `assoc' through a high relative
  cost for calling `assq' or `assoc' on a short list (compared to calling
  the C implementation).
  
  This time, I finally saw what I've been missing: It's crucial to pair
  `call' and `ret' instructions on x86. That won't be news to compiler
  writers; it's a basic fact that I missed along the way.
  
  When the JIT generates a non-tail call from to other code that it
  generates, it sets up the called procedure's frame directly (because
  various computed values are more readily available before jumping to
  the called procedure). After setting up the frame --- including a
  return address --- the target code was reached using `jmp'. Later, the
  `ret' to return from the non-tail call would confuse the processor and
  caused stalls, because the `ret' it wasn't matched with its `call'.
  It's easy enough to put the return address in place using `call' when
  setting up a frame, which exposes the right nesting to the processor.
  
  The enclosed table shows the effect on traditional Scheme
  microbenchmarks. Improvements of 20% are common, and several improve by
  50% or more. It's difficult to say which real code will benefit, but I
  think the improvement is likely to be useful.
_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev

Re: [racket-dev] better x86 performance

2011-04-24 Thread Eli Barzilay
Two minutes ago, Robby Findler wrote:
 On Sun, Apr 24, 2011 at 7:56 PM, Eli Barzilay e...@barzilay.org wrote:
  An hour and a half ago, Matthew Flatt wrote:
 
  [...] Later, the `ret' to return from the non-tail call would
  confuse the processor and caused stalls, because the `ret' it wasn't
  matched with its `call'.  It's easy enough to put the return address
  in place using `call' when setting up a frame, which exposes the
  right nesting to the processor.
 
  Does this mean that the code was correct, only it followed a pattern
  that is not commonly produced by most compilers?
 
 Yes, except that the issue here is branch (jump) prediction not so
 much the fact that compilers commonly produce call/ret pairs. That
 is, the processor can do a much better job of keeping things running
 fast when it can predict which instruction is going to come after
 the current one [...]

Oh right -- the main advantage is in prediction.

(I know about it, just didn't see the connection to it.)

(Also, this is a much more subtle point than Matthew's post made it
sound when he said that `call's are better paired with `ret's -- that
sounded like a more real bug.)

-- 
  ((lambda (x) (x x)) (lambda (x) (x x)))  Eli Barzilay:
http://barzilay.org/   Maze is Life!

_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev

Re: [racket-dev] better x86 performance

2011-04-24 Thread Vincent St-Amour
These are impressive speedups!

Given how close we were to the fastest Scheme compilers on some of
these, that may be enough to give us the lead.

I'll run the benchmarks on different implementations tomorrow.

Vincent


At Sun, 24 Apr 2011 17:11:21 -0600,
Matthew Flatt wrote:
 
 The `assoc' example helped focus my attention on a long-unsolved issue
 with JIT-generated code, where non-tail calls from JIT-generated code
 to other JIT-generated code seemed more expensive than they should be.
 This effect showed up in `assq' and `assoc' through a high relative
 cost for calling `assq' or `assoc' on a short list (compared to calling
 the C implementation).
 
 This time, I finally saw what I've been missing: It's crucial to pair
 `call' and `ret' instructions on x86. That won't be news to compiler
 writers; it's a basic fact that I missed along the way.
 
 When the JIT generates a non-tail call from to other code that it
 generates, it sets up the called procedure's frame directly (because
 various computed values are more readily available before jumping to
 the called procedure). After setting up the frame --- including a
 return address --- the target code was reached using `jmp'. Later, the
 `ret' to return from the non-tail call would confuse the processor and
 caused stalls, because the `ret' it wasn't matched with its `call'.
 It's easy enough to put the return address in place using `call' when
 setting up a frame, which exposes the right nesting to the processor.
 
 The enclosed table shows the effect on traditional Scheme
 microbenchmarks. Improvements of 20% are common, and several improve by
 50% or more. It's difficult to say which real code will benefit, but I
 think the improvement is likely to be useful.
_
  For list-related administrative tasks:
  http://lists.racket-lang.org/listinfo/dev