Hi Neil,

Thanks for looking into this performance problem.

I had the assumption that

(1) code conversion is really a small portion of the overall Zip read/write operation (2) I will have the time to do the same optimization for UTF_8 charset as I did for
     other singlebyte charsets [2] in JDK7 time frame

when worked on the changeset 1aaeb8fbe705 [1]. It appears I was a little optimistic on my planning :-( just never had chance to go back to my charset work after that.

But obviously if we only look at the utf8 conversion performance, to use the standard UTF_8 charset is slower compared to the original embedded utf8 conversion methods.

The patch definitely improves the situation, I did some measurement simply on ZipFile.getEntry(String) as showed below on a jar file with a single entry named
"abcdefghijklmnopqrstuvwxyz".

    public static void main(String[] args) throws Throwable {
        ZipFile zf = new ZipFile(args[0]);
        String ename = args[1];
        if (zf.getEntry(ename) == null) {
            System.out.println("no such entry:" + ename);
            System.exit(1);
        }
        // warmup
        for (int i = 0; i < 10000; i++) {
            zf.getEntry(ename);
        }
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 100000; i++) {
            zf.getEntry(ename);
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("time=%d%n", t1-t0);
        zf.close();
    }

It shows the WithoutPath/WithPatch ratio = 125/107 (this is not an accurate measurement
of the utf8 conversion though, just give us a rough idea)

However,  the approach

(1) is only about the zc.getBytesUTF8(), we should do the same thing for zc.toStringUTF8().
And ZipIn/OutputStream also uses ZipCoder.

(2) is kinda against the direction that we are planning to do in the future release, in which we might want to bring most of the ZipFile access code up to Java level from native C code, we have a pure Java ZipFile (no zlib, yet) implementation in the zip filesystem code at src/share/demo/nio/zipfs already, we should probably do the same thing for ZipFile as well.
Sure we will have to keep the native code anyway for vm access during boot.

(3) And once we put the UTF-8 optimization work in, we probably have to again back this one out. I do have the UTF_8 optimization code ready (not fully tested though), I also ran the same
test above, I got the number around "85".

The webrev for the utf-8 optimization is at
http://cr.openjdk.java.net/~sherman/utf8Str/webrev
with the "improvement" showed at [3] [4]
Any codereview volunteer ? :-)  maybe just too late for 7.

-Sherman

[1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
[2] http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs
[3] http://cr.openjdk.java.net/~sherman/utf8Str/client <http://cr.openjdk.java.net/%7Esherman/utf8Str/client> [4] http://cr.openjdk.java.net/~sherman/utf8Str/server <http://cr.openjdk.java.net/%7Esherman/utf8Str/server>


On 04/25/2011 09:45 AM, Neil Richards wrote:
Hi,
Changeset 1aaeb8fbe705 [1] implemented a good set of enhancements, to
allow the zip code to properly support zip files containing names and
comments either encoded in non-UTF-8 codesets, or which contain
supplementary characters.

This support is achieved by doing the character conversion in Java code
(using the code in ZipCoder.getBytes()), then feeding the resulting
byte[] through to native code.

This approach is more costly, both in terms of memory allocation and
amount of data copying, than the preceding one, which did the conversion
directly into a native byte[] using JNI's 'GetStringUTFRegion', which
assumes a codeset of "modified UTF-8".

The conversion performed for "modified UTF-8" is the same as that for
standard UTF-8 for all codepoints except those for supplementary
characters or for the character '\u0000' [2].

As the "common-case" is likely to be dealing with zip files with UTF-8
encoded names and comments containing neither supplementary characters
nor '\u0000', the increase in cost of the conversion in this case is
unfortunate.

By introducing an extra check to determine if a UTF-8 encoded string may
be safely converted using "modified UTF-8" (ie. to check that it has not
any supplementary codepoints or '\u0000' in it), and using the
aforementioned JNI routine to do the conversion if the check succeeds,
I've noticed a decent performance improvement can be achieved by
catering for this common-case.

This benefit is particularly significant for applications which do a lot
of rummaging around in zip files, or those involving lots of jar files
and/or jar files with lots of entries in them.

Please find below a changeset which implements the check I describe
above ("isSafeToUseModifiedUTF8"), and uses it to choose whether to call
a new modified-UTF8-specific variant of the native getEntry() method
("getEntryByModifiedUTF8"), which makes use of the JNI
'GetStringUTFRegion' method for the conversion.

Please let me know if you have and comments, criticism of suggestions on
this,
Thanks,
Neil

[1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
[2] 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8


Reply via email to