Re: Request for review: Optimize common-case UTF8 path for native ZipFile.getEntry calls

Xueming Shen Tue, 26 Apr 2011 13:06:45 -0700

Hi Neil,

Thanks for looking into this performance problem.


I had the assumption that

(1) code conversion is really a small portion of the overall Zipread/write operation(2) I will have the time to do the same optimization for UTF_8 charsetas I did for

     other singlebyte charsets [2] in JDK7 time frame

when worked on the changeset 1aaeb8fbe705 [1]. It appears I was a littleoptimisticon my planning :-( just never had chance to go back to my charset workafter that.

But obviously if we only look at the utf8 conversion performance, to usethe standardUTF_8 charset is slower compared to the original embedded utf8conversion methods.

The patch definitely improves the situation, I did some measurementsimply onZipFile.getEntry(String) as showed below on a jar file with a singleentry named

"abcdefghijklmnopqrstuvwxyz".

    public static void main(String[] args) throws Throwable {
        ZipFile zf = new ZipFile(args[0]);
        String ename = args[1];
        if (zf.getEntry(ename) == null) {
            System.out.println("no such entry:" + ename);
            System.exit(1);
        }
        // warmup
        for (int i = 0; i < 10000; i++) {
            zf.getEntry(ename);
        }
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 100000; i++) {
            zf.getEntry(ename);
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("time=%d%n", t1-t0);
        zf.close();
    }

It shows the WithoutPath/WithPatch ratio = 125/107 (this is not anaccurate measurement

of the utf8 conversion though, just give us a rough idea)

However,  the approach

(1) is only about the zc.getBytesUTF8(), we should do the same thing forzc.toStringUTF8().

And ZipIn/OutputStream also uses ZipCoder.

(2) is kinda against the direction that we are planning to do in thefuture release, in which wemight want to bring most of the ZipFile access code up to Java levelfrom native C code, wehave a pure Java ZipFile (no zlib, yet) implementation in the zipfilesystem code atsrc/share/demo/nio/zipfs already, we should probably do the same thingfor ZipFile as well.

Sure we will have to keep the native code anyway for vm access during boot.

(3) And once we put the UTF-8 optimization work in, we probably have toagain back this oneout. I do have the UTF_8 optimization code ready (not fully testedthough), I also ran the same

test above, I got the number around "85".

The webrev for the utf-8 optimization is at
http://cr.openjdk.java.net/~sherman/utf8Str/webrev
with the "improvement" showed at [3] [4]
Any codereview volunteer ? :-)  maybe just too late for 7.

-Sherman

[1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
[2] http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs

[3] http://cr.openjdk.java.net/~sherman/utf8Str/client<http://cr.openjdk.java.net/%7Esherman/utf8Str/client>[4] http://cr.openjdk.java.net/~sherman/utf8Str/server<http://cr.openjdk.java.net/%7Esherman/utf8Str/server>



On 04/25/2011 09:45 AM, Neil Richards wrote:

Hi,
Changeset 1aaeb8fbe705 [1] implemented a good set of enhancements, to
allow the zip code to properly support zip files containing names and
comments either encoded in non-UTF-8 codesets, or which contain
supplementary characters.

This support is achieved by doing the character conversion in Java code
(using the code in ZipCoder.getBytes()), then feeding the resulting
byte[] through to native code.

This approach is more costly, both in terms of memory allocation and
amount of data copying, than the preceding one, which did the conversion
directly into a native byte[] using JNI's 'GetStringUTFRegion', which
assumes a codeset of "modified UTF-8".

The conversion performed for "modified UTF-8" is the same as that for
standard UTF-8 for all codepoints except those for supplementary
characters or for the character '\u0000' [2].

As the "common-case" is likely to be dealing with zip files with UTF-8
encoded names and comments containing neither supplementary characters
nor '\u0000', the increase in cost of the conversion in this case is
unfortunate.

By introducing an extra check to determine if a UTF-8 encoded string may
be safely converted using "modified UTF-8" (ie. to check that it has not
any supplementary codepoints or '\u0000' in it), and using the
aforementioned JNI routine to do the conversion if the check succeeds,
I've noticed a decent performance improvement can be achieved by
catering for this common-case.

This benefit is particularly significant for applications which do a lot
of rummaging around in zip files, or those involving lots of jar files
and/or jar files with lots of entries in them.

Please find below a changeset which implements the check I describe
above ("isSafeToUseModifiedUTF8"), and uses it to choose whether to call
a new modified-UTF8-specific variant of the native getEntry() method
("getEntryByModifiedUTF8"), which makes use of the JNI
'GetStringUTFRegion' method for the conversion.

Please let me know if you have and comments, criticism of suggestions on
this,
Thanks,
Neil

[1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
[2] 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8

Re: Request for review: Optimize common-case UTF8 path for native ZipFile.getEntry calls

Reply via email to