On 12/4/16, 1:21 PM, Scott Palmer wrote:
Excuse me if this is the wrong list for this discussion.  Please direct me to 
the right place if this isn’t it.

When doing an analysis of garbage generation in our application we discovered a 
significant number of redundant strings generated by the class loader.  In my 
case there are hundreds of jars on the classpath - everything in the 
application is a plugin.  I figured on average 10kB of useless garbage char[]s 
were generated per findResource call for plugin resources.

This is caused mostly by the ZipFile implementation.  What is the purpose of 
java.util.zip.ZipCoder’s byte[] getBytes(String s) method?  It seems to simply 
be a custom implementation of string.getBytes(CharSet cs) and as such needs to 
first make a copy of the char[] to work on.

The "entry name" stored in the zip/jar file is not encoded as a UTF16 char sequence but bytes in some "native" encodings, utf8 is one of these encodings the ZipFile supports. The default one for a jar file is utf8. So when you want to lookup a resource from the jar file with a name as a String object, we have to convert/encode this "name" from String into the corresponding byte[] in utf8 and do a hash table lookup to find the resource. Here are some implementation details

(1) why do we need a "custom" version in ZipFile. This is because String.getBytes(cs) replaces unmappable/malformed chars with "?" silently, ZipFile API needs to throw an corresponding
exception in this scenario, so we have to have a "custom" version to do it.

(2) for performance reason we don't want to convert all jar entry names in all open jar file into either String or char[] in advance, they are kept as byte[] in their original form and we don't even have a single byte[] copy for each entry name, all names are kept in their original "cen" table form in byte[] and we only have a "offset" to each entry's offset. We are talking about hundreds of jars and each jar has hundreds if not thousands of entries. Arguably we can do the other way around, always convert those entry names in each open jar file to String, and then we don't have to do the String->byte[] during lookup. It's a design decision. If there is enough evidence suggests otherwise, it can be changed/doable, given we now have all the implementation at
Java level in jdk9.

That said, given the optimization we have done for String in jdk9, it might be worth considering to have a fast path for those ascii-only entry names (I would assume 99.9%+ of the entry names are ascii-only in real world), then it should take a simple byte[] copy to convert/encode those
entry names from String to byte[].

sherman

  This combined with the need to operate on byte[] path names internally in the 
ZipFile implementation means that URLClassLoader generates a lot of unnecessary 
garbage in a findResource call - proportional to the number of jars on the 
classpath.

Since JarFile forces the ZipFile to be open with UTF-8 always, if there was 
some API exposed that took a byte[] for the resource name, all of that extra 
string copying and encoding could be hoisted out of the loop in 
sun.misc.URLClassPath. Would this be worth it creating an internal class for 
something like a ‘ClasspathJarFile’ to and tweaking ZipFile so the byte[] based 
method is protected instead of private?

I also noticed that sun.net.www.ParseUtil.encodePath(String, boolean) usually 
had nothing useful to do but still made three copies of the string passed in 
anyway (two char arrays to work on, and the String returned).



Cheers,

Scott


Reply via email to