RFR: 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code

Volker Simonis Tue, 06 Oct 2020 06:52:09 -0700

### Summary

Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code which 
can lead to the `ZipException "invalid
entry compressed size"`.


### Motivation

In general it is not safe to directly write a ZipEntry obtained from 
`ZipInputStream.getNextEntry()`,
`ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with 
`ZipOutputStream.putNextEntry()` to a
`ZipOutputStream` and then read the entries data from the `ZipInputStream` and 
write it to the `ZipOutputStream` as
follows:
 ZipEntry entry;
 ZipInputStream zis = new ZipInputStream(...);
 ZipOutputStream zos = new ZipOutputStream(...);
 while((entry = zis.getNextEntry()) != null) {
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
The problem with this code is that the zip file format does not record the 
compression level used for deflation in its
entries. In general, it doesn't even mandate a predefined compression ratio per 
compression level. Therefore the
compressed size recorded in a `ZipEntry` read from a zip file might differ from 
the new compressed size produced by the
receiving `ZipOutputStream`. Such a difference will result in a `ZipException` 
with the following message:
 java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 
7 bytes)
 
The correct way of copying all entries from one zip file into another requires 
the creation of a new `ZipEntry` or at
least resetting of the compressed size field. E.g.:
 while((entry = zis.getNextEntry()) != null) {
     ZipEntry newEntry = new ZipEntry(entry.getName());
     zos.putNextEntry(newEntry);
     zis.transferTo(zos);
 }
or:
 while((entry = zis.getNextEntry()) != null) {
     entry.setCompressedSize(-1);
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
Unfortunately, there's a lot of user code out there which gets this wrong and 
uses the bad coding pattern described
before. Searching for `"java.util.zip.ZipException: invalid entry compressed 
size (expected 12 but got 7 bytes)"` gives
~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of 
instances of this anti-pattern on GitHub when
doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x 
wrapper task][1] is affected as well as the
latest version of the [mockableAndroidJar task][2]. I've recently fixed two 
occurrences of this pattern in OpenJDK (see
[JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them (e.g.
[`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 :).

### Description

So while this has clearly been a problem before, it apparently wasn't painful 
enough to trigger any action from the
side of the JDK. However, recently quite some zlib forks with [superior 
deflate/inflate performance have evolved][6].
Using them with OpenJDK is quite straight-forward: one just has to configure 
the alternative implementations by setting
`LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by 
using these new zlib implementations for
selected services in production and the only reason why we haven't enabled them 
by default until now is the problem
I've just described. The reason why these new libraries uncover the described 
anti-pattern much more often is because
their compression ratio is slightly different from that of the default zlib 
library. This can easily trigger a
`ZipException` even if an application is not using a different compression 
levels but just a zip file created with
another zlib version.

I'd therefore like to propose the following workaround for the wrong 
`ZipOutputStream.putNextEntry()` usage in user
code:

-  ignore the compressed size if it was implicitly determined from the zip file 
and not explicitly set by calling
   `ZipEntry.setCompressedSize()`.

- Change the API-documentation of `ZipOutputStream.putNextEntry()` and 
`JarOutputStream.putNextEntry()` to explain the
  problem and why `putNextEntry()` will ignore the compressed size of a 
`ZipEntry` if that was set implicitely when
  reading that entry from a `ZipFile` or `ZipInputStream`.


### Technical Details

A zip file consists of a stream of File Entries followed by a Central Directory 
(see [here for a more detailed
specification][7]). Each File Entry is composed of a Local File Header (LFH) 
followed by the compressed Data and an
optional Data Descriptor. The LFH contains the File Name and among other 
attributes the Compressed and Uncompressed
size and CRC of the Data. In the case where the latter three attributes are not 
available at the time when the LFH is
created, this fact will be recorded in a flag of the LFH and will trigger the 
creation of a Data Descriptor with the
corresponding information right after the Data section. Finally, the Central 
Directory contains one Central Directory
File Header (CDFH) for each entry of the  zip archive. The CDFH is an extended 
version of the LFH and the ultimate
reference for the contents of the zip archive. The redundancy between LFH and 
CDFH is a tribute to zip's long history
when it was used to store archives on multiple floppy discs and the CDFH 
allowed to update the archive by only writing
to the last disc which contained the Central Directory.

`ZipEntries` read with `ZipInputStream.getNextEntry()` will initially only 
contain the information from the LFH. Only
after the next entry was read (or after `ZipInputStream.closeEntry()` was 
called explicitly), will the previously read
entry be updated with the data from the Data Descriptor. `ZipInputStream` 
doesn't inspect the Central Directory at all.

On the other hand, `ZipFile` only queries the Central Directory for `ZipEntry` 
information so all `ZipEntries` returned
by `ZipFile.entries()`, `ZipFile.getEntry()` and `ZipFile.stream()` will always 
instantly contain the full Compressed
and Uncompressed Size and CRC information for each entry independently of the 
LFH contents.

### Risks and Assumptions

If we choose to ignore the implicitly recorded compressed size in a `ZipEntry` 
read from a zip file when writing it to
a `ZipOutputStream`, this will lead to zip files with incomplete information in 
the LFH and an additional Data
Descriptor as described before. However, the result is still fully compatible 
to the zip file specification. It's also
not unusual, because by default all new zip files created with 
`ZipOutputStream` will contain LFHs without Compressed
and Uncompressed Size and CRC information and an additional Data Descriptor. 
Theoretically it is possible to create new
zip files with `ZipOutputStream` class and Compressed and Uncompressed Size and 
CRC information in the LFH but that's
complex and inefficient because it requires two steps. A first step to 
determine the crc and compressed size of the
data and a second step to actually write the data to the `ZipOutputStream` 
(which will compress it a second time). This
is because the current API offers no possibility to write already compressed 
data to a `ZipOutputStream`.

Consequently, the only straight-forward way of creating zip files from Java 
which have all the data in the LFH and no
Data Descriptor is by copying `ZipEntries` from an existing zip file with the 
buggy method described before. This
incidentally worked more or less reliable for a long time but breaks miserably 
when using different zlib
implementations. Ignoring the implicitly set compressed size of `ZipEntries` 
can easily fix this problem.

I'm not aware of any tool which can not handle such files and if it exists it 
would have problems with the majority of
Java created zip files anyway (e.g. all jar-files created with the jar tool 
have zip entries with incomplete LFH data
and additional Data Descriptor).

Ignoring the implicitly set compressed size of `ZipEntries` has no measurable 
performance impact and will increase the
size of zip archives which used to have the complete file information in the 
LFH before by 16 bytes per entry. On the
other hand it will give us the freedom to use whatever zip implementation we 
like :)


[1]:
https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
[2]:
https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
[3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]: 
https://bugs.openjdk.java.net/browse/JDK-8240235 [5]:
https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java
 [6]:
https://github.com/simonis/zlib-bench/blob/master/Results.md [7]: 
https://en.wikipedia.org/wiki/Zip_(file_format)

-------------

Commit messages:
 - 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user 
code

Changes: https://git.openjdk.java.net/jdk/pull/520/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=520&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8253952
  Stats: 244 lines in 4 files changed: 242 ins; 0 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/520/head:pull/520

PR: https://git.openjdk.java.net/jdk/pull/520

RFR: 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code

Reply via email to