Hi core-libs-dev,

While doing performance profiling of my application, I discovered that nearly 
50% of the time deserializing JSON was spent within String.intern().  I 
understand that in general interning Strings is not the best approach for 
things, but I think I have a decent use case -- the value of a certain field is 
one of a very limited number of valid values (that are not known at compile 
time, so I cannot use an Enum), and is repeated many millions of times in the 
JSON stream.

I discovered that replacing String.intern() with a ConcurrentHashMap improved 
performance by almost an order of magnitude.

I'm not the only person that discovered this and was surprised: 
http://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern

I've been excited about starting to contribute to OpenJDK, so I am thinking 
that this might be a fun project for me to take on and then contribute back.  
But I figured I should check in on the list before spending a lot of time 
tracking this down.  I have a couple of preparatory questions:

* Has this bottleneck been examined thoroughly before?  Am I wishing too hard 
for performance here?

* String.intern() is a native method currently.  My understanding is that there 
is a nontrivial penalty to invoking native methods (at least via JNI, not sure 
if this is also true for "built ins"?).  I assume the reason that this is 
native is so the Java intern is the same as C++-invoked interns from within the 
JVM itself.  Is this an actual requirement, or could String.intern be replaced 
with Java code?

* If the interning itself must be handled by a symbol table in C++ land as it 
is today, would a "second level cache" in Java land that invokes a native 
"intern0" method be acceptable, so that there is a low-penalty "fast path"?  If 
so, this would involve a nonzero memory cost, although I assume that a few 
thousand references inside of a Map is an OK price to pay for a (for example) 
5x speedup.

* I assume the String class itself is loaded at a very sensitive time during VM 
initialization.  Having String initialization trigger (for example) 
ConcurrentHashMap class initialization may cause problems or circularities.  If 
this is the case, would triggering such a load lazily on the first intern() 
call be "late enough" as to not cause problems?

I'm sure that if I get anywhere with this I will have more questions, but this 
should get me started. Thank you for any advice / insight you may be able to 
provide!

Steven

Reply via email to