As far as I know there is no specific optimization of SwitchPoint i.e there is still a volatile read in the middle of the pattern.
Rémi On 05/26/2011 08:40 AM, Charles Oliver Nutter wrote: > Now for something completely different: SwitchPoint-based "constant" > lookup in JRuby. > > It's certainly possible I'm doing something wrong here, but using a > SwitchPoint for constant invalidation in JRuby (rather than pinging a > global serial number) is significantly slower. > > Using SwitchPoint: > > ~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb > 10 > user system > total real > 100k * 100 nested const get 1.342000 0.000000 > 1.342000 ( 1.286000) > 100k * 100 nested const get 1.030000 0.000000 > 1.030000 ( 1.030000) > 100k * 100 nested const get 1.131000 0.000000 > 1.131000 ( 1.131000) > 100k * 100 nested const get 1.085000 0.000000 > 1.085000 ( 1.085000) > 100k * 100 nested const get 1.019000 0.000000 > 1.019000 ( 1.019000) > 100k * 100 inherited const get 1.230000 0.000000 > 1.230000 ( 1.230000) > 100k * 100 inherited const get 0.989000 0.000000 > 0.989000 ( 0.989000) > 100k * 100 inherited const get 0.981000 0.000000 > 0.981000 ( 0.981000) > 100k * 100 inherited const get 0.988000 0.000000 > 0.988000 ( 0.988000) > 100k * 100 inherited const get 1.025000 0.000000 > 1.025000 ( 1.025000) > 100k * 100 both 1.206000 0.000000 > 1.206000 ( 1.206000) > 100k * 100 both 0.992000 0.000000 > 0.992000 ( 0.992000) > 100k * 100 both 0.989000 0.000000 > 0.989000 ( 0.989000) > 100k * 100 both 1.000000 0.000000 > 1.000000 ( 1.000000) > 100k * 100 both 1.003000 0.000000 > 1.003000 ( 1.003000) > > Using a global serial number ping: > > 100k * 100 nested const get 0.082000 0.000000 > 0.082000 ( 0.082000) > 100k * 100 nested const get 0.088000 0.000000 > 0.088000 ( 0.087000) > 100k * 100 nested const get 0.082000 0.000000 > 0.082000 ( 0.082000) > 100k * 100 nested const get 0.082000 0.000000 > 0.082000 ( 0.082000) > 100k * 100 nested const get 0.082000 0.000000 > 0.082000 ( 0.082000) > 100k * 100 inherited const get 0.084000 0.000000 > 0.084000 ( 0.084000) > 100k * 100 inherited const get 0.085000 0.000000 > 0.085000 ( 0.085000) > 100k * 100 inherited const get 0.083000 0.000000 > 0.083000 ( 0.083000) > 100k * 100 inherited const get 0.083000 0.000000 > 0.083000 ( 0.083000) > 100k * 100 inherited const get 0.083000 0.000000 > 0.083000 ( 0.083000) > 100k * 100 both 0.096000 0.000000 > 0.096000 ( 0.096000) > 100k * 100 both 0.097000 0.000000 > 0.097000 ( 0.097000) > 100k * 100 both 0.105000 0.000000 > 0.105000 ( 0.105000) > 100k * 100 both 0.097000 0.000000 > 0.097000 ( 0.097000) > 100k * 100 both 0.086000 0.000000 > 0.086000 ( 0.086000) > > Perhaps SwitchPoint has not had optimization love yet? > > FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I > *think* is b141), so there's nothing to compare it to (i.e. I don't > consider this a regression...just slow). > > I can investigate this further on demand. > > - Charlie > > On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter > <[email protected]> wrote: >> Ok, here we go with the macosx build from 5/13. Performance is >> *substantially* better. >> >> First tak: >> >> user system total real >> 1.401000 0.000000 1.401000 ( 0.821000) >> 0.552000 0.000000 0.552000 ( 0.552000) >> 0.561000 0.000000 0.561000 ( 0.561000) >> 0.552000 0.000000 0.552000 ( 0.552000) >> 0.553000 0.000000 0.553000 ( 0.553000) >> >> Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic. >> >> Now fib: >> >> 9227465 >> 0.979000 0.000000 0.979000 ( 0.922000) >> 9227465 >> 0.848000 0.000000 0.848000 ( 0.848000) >> 9227465 >> 0.796000 0.000000 0.796000 ( 0.796000) >> 9227465 >> 0.792000 0.000000 0.792000 ( 0.792000) >> 9227465 >> 0.786000 0.000000 0.786000 ( 0.787000) >> >> The margin is not as great here, but it's easily 20% faster than even >> the reverted GWT (no idea bout the new GWT logic yet). >> >> I can provide assembly dumps and other logs from both builds on >> request. Where shall we start? >> >> Disclaimer: I know optimizing for simple cases like fib and tak is not >> a great idea, but it seems like if we can't make them fast we're going >> to have trouble with a lot of other stuff. I will endeavor to get >> numbers for less synthetic benchmarks too. >> >> - Charlie >> >> On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter >> <[email protected]> wrote: >>> Ok, onward with perf exploration, folks! >>> >>> I'm running with mostly-current MLVM, with John's temporary reversion >>> of GWT to the older non-ricochet logic. >>> >>> As reported before, "fib" has improved with the reversion, but it's >>> only marginally faster than JRuby's inline caching logic and easily >>> 30-40% slower than it was in builds from earlier this month. >>> >>> I also decided to run "tak", which is another dispatch and >>> recursion-heavy benchmark. This still seems to have a perf >>> degradation. >>> >>> Here's with standard settings, current MLVM, amd64: >>> >>> >>> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5 >>> user system total real >>> 2.443000 0.000000 2.443000 ( 2.383000) >>> 1.985000 0.000000 1.985000 ( 1.985000) >>> 2.007000 0.000000 2.007000 ( 2.007000) >>> 1.987000 0.000000 1.987000 ( 1.987000) >>> 1.991000 0.000000 1.991000 ( 1.991000) >>> >>> Here is with JRuby's inline caching. Given that tak is an arity three >>> method, it's likely that the usually megamorphic inline cache is still >>> monomorphic, so things are inlining through it when they wouldn't >>> normally: >>> >>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false >>> bench/bench_tak.rb 5 >>> user system total real >>> 1.565000 0.000000 1.565000 ( 1.510000) >>> 0.624000 0.000000 0.624000 ( 0.624000) >>> 0.624000 0.000000 0.624000 ( 0.624000) >>> 0.624000 0.000000 0.624000 ( 0.624000) >>> 0.632000 0.000000 0.632000 ( 0.632000) >>> >>> Oddly enough, modifying the benchmark to guarantee there's at least >>> three different method calls of arity 3 does not appear to degrade >>> this benchmark... >>> >>> Moving on to dynopt (reminder: this emits two invocations at compile >>> time, one a guarded invokevirtual or invokestatic and the other a >>> normal CachingCallSite.call): >>> >>> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false >>> -Xcompile.dynopt=true bench/bench_tak.rb 5 >>> user system total real >>> 0.703000 0.000000 0.703000 ( 0.630000) >>> 0.514000 0.000000 0.514000 ( 0.514000) >>> 0.511000 0.000000 0.511000 ( 0.511000) >>> 0.512000 0.000000 0.512000 ( 0.512000) >>> 0.510000 0.000000 0.510000 ( 0.510000) >>> >>> This is the "ideal" for invokedynamic, which hopefully should inline >>> as well as this guarded direct invocation (right?). >>> >>> Now, it gets a bit more interesting. If I turn recursive inlining down >>> to zero and use invokedynamic: >>> >>> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0 >>> bench/bench_tak.rb 5 >>> user system total real >>> 1.010000 0.000000 1.010000 ( 0.954000) >>> 0.869000 0.000000 0.869000 ( 0.869000) >>> 0.870000 0.000000 0.870000 ( 0.870000) >>> 0.869000 0.000000 0.869000 ( 0.869000) >>> 0.870000 0.000000 0.870000 ( 0.870000) >>> >>> Performance is easily 2x what it is with stock inlining settings. >>> Something about invokedynamic or the MH chain is changing the >>> characteristics of inlining in a way different from dynopt. >>> >>> So what looks interesting here? For which combination would you be >>> interested in seeing logs? >>> >>> FWIW, I am pulling earlier builds now to try out fib and tak and get >>> assembly output from them. >>> >>> - Charlie >>> > _______________________________________________ > mlvm-dev mailing list > [email protected] > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev _______________________________________________ mlvm-dev mailing list [email protected] http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
