Hi Steven,

This looks like a promising patch. If all goes well I can sponsor this patch into hotspot-rt.

I have problems building your test. Could you send a stand-alone test that can be build with plain JDK (i.e., exclude things like com.google.common. and org.openjdk.jmh)?

Also, I saw that you tested with strings that are the same length as UUID, which are 36 chars long:

   2544f85f-1845-42c4-aba1-7a431e6edb99
   eff61458-32f8-45cd-a77d-fddb14951cab
   d685f1a2-0b95-4021-b9b3-b0fb3c02a17f

We do need to handle strings with shorter length, and making calls to memcmp() may cause degradation with lots of short strings. Can you add a sub-test that tests for strings length <= 4?

I would also recommend running a few non-trivial apps (such as eclipse) and dump the interned string table to see the distribution of lengths. I know this is a hassle, but it would be a good last step to wrap up your work.

Meanwhile, I will try to run our benchmarks (aka "refworkload") to check for any performance impact.

Thanks
- Ioi

On 08/04/2013 10:21 PM, Steven Schlansker wrote:
On Tue, 11 Jun 2013 10:28:14 +0100
Alan Bateman <alan.bate...@oracle.com> wrote:

On 10/06/2013 19:06, Steven Schlansker wrote:
Hi core-libs-dev,

While doing performance profiling of my application, I discovered
that nearly 50% of the time deserializing JSON was spent within
String.intern().  I understand that in general interning Strings is
not the best approach for things, but I think I have a decent use
case -- the value of a certain field is one of a very limited
number of valid values (that are not known at compile time, so I
cannot use an Enum), and is repeated many millions of times in the
JSON stream.

Have you run with -XX:+PrintStringTableStatistics? Might be
interesting if you can share the output (it is printed just before
the VM terminates).

There are also tuning knobs such as StringTableSize and would be
interesting to know if you've experimented with.

-Alan.
Hi everyone,

Thanks again for your useful advice.  I definitely misjudged the
difficulty and complexity of working with methods that directly bridge
the Java <-> C++ "gap"!  As such, it took me way longer to get to this
than I expected...

That said, I think I have very good results to report.  OpenJDK8 is
already *significantly* better than OpenJDK 7 was, but still can be
better.

Here is the patch I have at the moment:
https://gist.github.com/stevenschlansker/6153643

I used oprofile with oprofile-jit to identify the hot spots in the
benchmark code as being java_lang_String::equals() and
java_lang_String::as_unicode_string.

Both of these methods have hand-written loops that copy or compare
jchar arrays.

The problem is that in -fastdebug or -slowdebug releases, this is
one method call per character (the function is not inlined).  Even in
-release builds, it seems that this is significantly slower than the
libc memcpy() or memcmp() functions which can use SSE4 (or other
related technologies).

My patch adds new methods, char_cmp and char_cpy, which delegate to the
mem* functions instead of using hand-written loops.

The micro-benchmark results are very good for such a small change.

On fastdebug, before:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2819
5        1.780        0.184  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1    343
5       14.571        0.310  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   8712
5        0.526        0.138  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4427
5        1.133        0.121  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1    603
5        8.319        0.161  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  17185
5        0.274        0.048  msec/op

After:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   2898
5        1.812        0.208  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1138
5        4.397        0.136  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1   9035
5        0.519        0.146  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   4538
5        1.094        0.107  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1363
5        3.686        0.100  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  16686
5        0.316        0.081  msec/op

On release, before:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4030
5        1.240        0.002  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   1024
5        4.894        0.042  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
5        0.185        0.002  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6143
5        0.814        0.005  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   1852
5        2.702        0.016  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
5        0.102        0.001  msec/op

After:

Benchmark                                          Mode Thr    Cnt
Sec         Mean   Mean error    Units
o.s.b.InternBenchmark.testLongStringChmIntern    sample   1   4040
5        1.236        0.002  msec/op
o.s.b.InternBenchmark.testLongStringJdkIntern    sample   1   2733
5        1.832        0.010  msec/op
o.s.b.InternBenchmark.testLongStringNoIntern     sample   1  20000
5        0.181        0.002  msec/op
o.s.b.InternBenchmark.testShortStringChmIntern   sample   1   6170
5        0.809        0.001  msec/op
o.s.b.InternBenchmark.testShortStringJdkIntern   sample   1   3577
5        1.396        0.007  msec/op
o.s.b.InternBenchmark.testShortStringNoIntern    sample   1  20000
5        0.102        0.000  msec/op


This is almost a 3.5x improvement on fastdebug builds, and more than a
2.5x improvement on release builds.  It is now only ~50% slower than
ConcurrentHashMap, at least for the low-contention case!  (I did not
benchmark higher numbers of threads thoroughly, but I do not think that
my changes could make that worse...)

Finally, the benchmark code:
https://github.com/stevenschlansker/jvm-intern-benchmark/blob/master/src/main/java/org/sugis/benchmark/InternBenchmark.java

It is not the best ever benchmark, but I'm hopeful that it's "good
enough" to demonstrate the wins I have.  Please let me know if you
believe the benchmark invalidates my conclusions.  It is run with JMH,
as that seems to be the standard way of doing things around here.

Thank you again for your time and input; I am hopeful that I have not
erred terribly :-)

Best,
Steven

Reply via email to