Christian Thalinger wrote: > On Aug 8, 2011, at 4:55 PM, Vladimir Kozlov wrote: > >> Christian, >> >> Should we put "skip bytecode quickening" code under flag to do this only >> when invoke dynamic is enabled? Or put_code is zero only in invoke dynamic >> case? > > No, it doesn't buy us anything. The new checking code is only executed the > first time as the bytecodes are quickened right after that. And in the case > where a putfield isn't quickened and we call resolve_get_put it gets very > expensive anyway.
You lost me here. New code in resolve_get_put() is executed only for putfield to CallSite.target. But new code in patch_bytecode() skips quickening for all putfield bytecodes. My question is: can you narrow skipping quickening only for putfield to CallSite.target? Or you are saying that there is no performance difference between executing _aputfield vs _fast_aputfield? Vladimir > >> On 8/8/11 6:56 AM, Christian Thalinger wrote: >>>> Why on sparc you use ld_ptr() to load from cache but on X86 and X64 you >>>> use movl() (only 32 bit)? >>> Good question. I took the code from TemplateTable::resolve_cache_and_index >>> without thinking about it and that one uses ld_ptr. >>> >>> _indices in CosntantPoolCacheEntry is defined as intx: >>> >>> volatile intx _indices; // constant pool index& rewrite bytecodes >>> >>> and bytecode 1 and 2 are in the upper 16-bit of the lower 32-bit word: >>> >>> // bit number |31 0| >>> // bit length |-8--|-8--|---16----| >>> // -------------------------------- >>> // _indices [ b2 | b1 | index ] >>> >>> Loading 32-bit on LE gives you the right bits but on BE it does not. I >>> think that's the reason for the "optimization" on x64. >> I don't like this "optimization" but I understand why we using it. Add a >> comment (especially in x64 file). > > I factored reading the bytecode into > InterpreterMacroAssembler::get_cache_and_index_and_bytecode_at_bcp since the > same code is used twice in TemplateTable and added the comment there. > >>>> I am concern about using next short branch in new code in >>>> templateTable_sparc.cpp: >>>> >>>> cmp_and_br_short(..., L_patch_done); // don't patch >>>> >>>> There is __ stop() call which generates a lot of code so that label >>>> L_patch_done could be far. >>> Yeah, I thought I give it a try if it works. cmp_and_br_short should >>> assert if the branch displacement is too far, right? >>> >> Yes, it will assert but may be only in some worst case which we do not test. >> For example, try to run 64 bit fastdebug VM on Sparc + compressed oops + >> VerifyOops. > > That works. > >>>> >>>> Why you added new #include into ciEnv.cpp and nmethod.cpp, what code needs >>>> it? Nothing else is changed in these files. >>> Both files use dependencies and I got linkage errors on Linux while working >>> on the fix (because of inline methods). It seems that the include is not >>> required in ciEnv.cpp because ciEnv.hpp already includes it. I missed >>> that. But nmethod.cpp needs it because nmethod.hpp only declares class >>> Dependencies. >>> >> OK. >> >>>> Why you did not leave "volatile" call site inlining with guard? You did >>>> not explain why virtual call is fine for it. >>> The spec of MutableCallSite says: >>> >>> "For target values which will be frequently updated, consider using a >>> volatile call site instead." >>> >>> And VolatileCallSite says: >>> >>> "A VolatileCallSite is a CallSite whose target acts like a volatile >>> variable. An invokedynamic instruction linked to a VolatileCallSite sees >>> updates to its call site target immediately, even if the update occurs in >>> another thread. There may be a performance penalty for such tight coupling >>> between threads. >>> >>> Unlike MutableCallSite, there is no syncAll operation on volatile call >>> sites, since every write to a volatile variable is implicitly synchronized >>> with reader threads. >>> >>> In other respects, a VolatileCallSite is interchangeable with >>> MutableCallSite." >>> >>> Since VolatileCallSite really should only be used when you know the target >>> changes very often we don't do optimizations for this case. Obviously this >>> is just a guess how people will use VolatileCallSite but I think for now >>> this is a safe bet. >>> >> Thank you for explaining it. >> >>> Additionally I had to do two small changes because the build was broken on >>> some configurations: >>> >>> - klassOop new_type = _changes.is_klass_change() ? >>> _changes.as_klass_change()->new_type() : NULL; >>> + klassOop new_type = _changes.is_klass_change() ? >>> _changes.as_klass_change()->new_type() : (klassOop) NULL; >>> >>> and >>> >>> - MutexLockerEx ccl(CodeCache_lock, thread); >>> + MutexLockerEx ccl(CodeCache_lock, Mutex::_no_safepoint_check_flag); >>> >>> I updated the webrev. >> Good. > > Thanks. > > -- Christian > >> Vladimir >> >>> -- Christian >>> >>>> >>>> Vladimir >>>> >>>> On 8/5/11 6:32 AM, Christian Thalinger wrote: >>>>> http://cr.openjdk.java.net/~twisti/7071653 >>>>> >>>>> 7071653: JSR 292: call site change notification should be pushed not >>>>> pulled >>>>> Reviewed-by: >>>>> >>>>> Currently every speculatively inlined method handle call site has a >>>>> guard that compares the current target of the CallSite object to the >>>>> inlined one. This per-invocation overhead can be removed if the >>>>> notification is changed from pulled to pushed (i.e. deoptimization). >>>>> >>>>> I had to change the logic in TemplateTable::patch_bytecode to skip >>>>> bytecode quickening for putfield instructions when the put_code >>>>> written to the constant pool cache is zero. This is required so that >>>>> every execution of a putfield to CallSite.target calls out to >>>>> InterpreterRuntime::resolve_get_put to do the deoptimization of >>>>> depending compiled methods. >>>>> >>>>> I also had to change the dependency machinery to understand other >>>>> dependencies than class hierarchy ones. DepChange got the super-type >>>>> of two new dependencies, KlassDepChange and CallSiteDepChange. >>>>> >>>>> Tested with JRuby tests and benchmarks, hand-written testcases, JDK >>>>> tests and vm.mlvm tests. >>>>> >>>>> Here is the speedup for the JRuby fib benchmark (first is JDK 7 b147, >>>>> second with 7071653). Since the CallSite targets don't change during >>>>> the runtime of this benchmark we can see the performance benefit of >>>>> eliminating the guard: >>>>> >>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>> 0.883000 0.000000 0.883000 ( 0.854000) >>>>> 0.715000 0.000000 0.715000 ( 0.715000) >>>>> 0.712000 0.000000 0.712000 ( 0.712000) >>>>> 0.713000 0.000000 0.713000 ( 0.713000) >>>>> 0.713000 0.000000 0.713000 ( 0.712000) >>>>> >>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>> 0.772000 0.000000 0.772000 ( 0.742000) >>>>> 0.624000 0.000000 0.624000 ( 0.624000) >>>>> 0.621000 0.000000 0.621000 ( 0.621000) >>>>> 0.622000 0.000000 0.622000 ( 0.622000) >>>>> 0.622000 0.000000 0.622000 ( 0.621000) >>>>> > _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev