On Tue, 11 Jun 2013 10:28:14 +0100 Alan Bateman <alan.bate...@oracle.com> wrote:
> On 10/06/2013 19:06, Steven Schlansker wrote: > > Hi core-libs-dev, > > > > While doing performance profiling of my application, I discovered > > that nearly 50% of the time deserializing JSON was spent within > > String.intern(). I understand that in general interning Strings is > > not the best approach for things, but I think I have a decent use > > case -- the value of a certain field is one of a very limited > > number of valid values (that are not known at compile time, so I > > cannot use an Enum), and is repeated many millions of times in the > > JSON stream. > > > Have you run with -XX:+PrintStringTableStatistics? Might be > interesting if you can share the output (it is printed just before > the VM terminates). > > There are also tuning knobs such as StringTableSize and would be > interesting to know if you've experimented with. > > -Alan. Hi everyone, Thanks again for your useful advice. I definitely misjudged the difficulty and complexity of working with methods that directly bridge the Java <-> C++ "gap"! As such, it took me way longer to get to this than I expected... That said, I think I have very good results to report. OpenJDK8 is already *significantly* better than OpenJDK 7 was, but still can be better. Here is the patch I have at the moment: https://gist.github.com/stevenschlansker/6153643 I used oprofile with oprofile-jit to identify the hot spots in the benchmark code as being java_lang_String::equals() and java_lang_String::as_unicode_string. Both of these methods have hand-written loops that copy or compare jchar arrays. The problem is that in -fastdebug or -slowdebug releases, this is one method call per character (the function is not inlined). Even in -release builds, it seems that this is significantly slower than the libc memcpy() or memcmp() functions which can use SSE4 (or other related technologies). My patch adds new methods, char_cmp and char_cpy, which delegate to the mem* functions instead of using hand-written loops. The micro-benchmark results are very good for such a small change. On fastdebug, before: Benchmark Mode Thr Cnt Sec Mean Mean error Units o.s.b.InternBenchmark.testLongStringChmIntern sample 1 2819 5 1.780 0.184 msec/op o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 343 5 14.571 0.310 msec/op o.s.b.InternBenchmark.testLongStringNoIntern sample 1 8712 5 0.526 0.138 msec/op o.s.b.InternBenchmark.testShortStringChmIntern sample 1 4427 5 1.133 0.121 msec/op o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 603 5 8.319 0.161 msec/op o.s.b.InternBenchmark.testShortStringNoIntern sample 1 17185 5 0.274 0.048 msec/op After: Benchmark Mode Thr Cnt Sec Mean Mean error Units o.s.b.InternBenchmark.testLongStringChmIntern sample 1 2898 5 1.812 0.208 msec/op o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 1138 5 4.397 0.136 msec/op o.s.b.InternBenchmark.testLongStringNoIntern sample 1 9035 5 0.519 0.146 msec/op o.s.b.InternBenchmark.testShortStringChmIntern sample 1 4538 5 1.094 0.107 msec/op o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 1363 5 3.686 0.100 msec/op o.s.b.InternBenchmark.testShortStringNoIntern sample 1 16686 5 0.316 0.081 msec/op On release, before: Benchmark Mode Thr Cnt Sec Mean Mean error Units o.s.b.InternBenchmark.testLongStringChmIntern sample 1 4030 5 1.240 0.002 msec/op o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 1024 5 4.894 0.042 msec/op o.s.b.InternBenchmark.testLongStringNoIntern sample 1 20000 5 0.185 0.002 msec/op o.s.b.InternBenchmark.testShortStringChmIntern sample 1 6143 5 0.814 0.005 msec/op o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 1852 5 2.702 0.016 msec/op o.s.b.InternBenchmark.testShortStringNoIntern sample 1 20000 5 0.102 0.001 msec/op After: Benchmark Mode Thr Cnt Sec Mean Mean error Units o.s.b.InternBenchmark.testLongStringChmIntern sample 1 4040 5 1.236 0.002 msec/op o.s.b.InternBenchmark.testLongStringJdkIntern sample 1 2733 5 1.832 0.010 msec/op o.s.b.InternBenchmark.testLongStringNoIntern sample 1 20000 5 0.181 0.002 msec/op o.s.b.InternBenchmark.testShortStringChmIntern sample 1 6170 5 0.809 0.001 msec/op o.s.b.InternBenchmark.testShortStringJdkIntern sample 1 3577 5 1.396 0.007 msec/op o.s.b.InternBenchmark.testShortStringNoIntern sample 1 20000 5 0.102 0.000 msec/op This is almost a 3.5x improvement on fastdebug builds, and more than a 2.5x improvement on release builds. It is now only ~50% slower than ConcurrentHashMap, at least for the low-contention case! (I did not benchmark higher numbers of threads thoroughly, but I do not think that my changes could make that worse...) Finally, the benchmark code: https://github.com/stevenschlansker/jvm-intern-benchmark/blob/master/src/main/java/org/sugis/benchmark/InternBenchmark.java It is not the best ever benchmark, but I'm hopeful that it's "good enough" to demonstrate the wins I have. Please let me know if you believe the benchmark invalidates my conclusions. It is run with JMH, as that seems to be the standard way of doing things around here. Thank you again for your time and input; I am hopeful that I have not erred terribly :-) Best, Steven