RFR: 8243254: Examine ZipFile slash optimization for non-ASCII compatible charsets

Claes Redestad Tue, 21 Apr 2020 06:52:28 -0700

Hi,

ZipFile.getEntry does optimizations to check for directory entries by
adding a '/' to the encoded byte array. JDK-8242959 improved on this
optimization, but a question was raised Jason Zaugg on whether the
optimization is always valid[1]. Turns out it isn't, but only for
non-ASCII compatible charsets.


While JarFiles are always assumed to be UTF-8-encoded or compatible,
ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as
either 2F 00 or 00 2F, which means the hash code would differ and a
directory "foo/" potentially not be found when looking up "foo". Further
complications arise when/if the directory name ends with a code point
that might be encoded so that the final byte is 2F, e.g. \u012F.

This patch only enables this low-level optimization when the charset
encoding used is known to be ASCII compatible in the sense that 2F will
be encoded as single-byte 2F. UTF-8 is compatible in this sense - and
since this is the charset exclusively used by JarFile these changes have
little to no effect on startup performance in the cases we've been
looking at.

Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/
Bug:    https://bugs.openjdk.java.net/browse/JDK-8243254

I've partially repurposed ZipFile/TestZipFile to test some such corner
cases. Mainly expanded it to test ZipFiles created with various non-
standard encodings, but also slimmed it down so that it can actually run
quickly and reliably - as well as enabled it in regular testing. The
updated test fails both before and after JDK-8242959, but passes with
the proposed changes.

Testing: tier1+2

Thanks!

/Claes

[1] https://twitter.com/retronym/status/1252134723431747584?s=20

RFR: 8243254: Examine ZipFile slash optimization for non-ASCII compatible charsets

Reply via email to