On Tue, 6 Oct 2020 10:02:09 GMT, Volker Simonis <simo...@openjdk.org> wrote:
> ### Summary > > Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code > which can lead to the `ZipException "invalid > entry compressed size"`. > ### Motivation > > In general it is not safe to directly write a ZipEntry obtained from > `ZipInputStream.getNextEntry()`, > `ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with > `ZipOutputStream.putNextEntry()` to a > `ZipOutputStream` and then read the entries data from the `ZipInputStream` > and write it to the `ZipOutputStream` as > follows: > ZipEntry entry; > ZipInputStream zis = new ZipInputStream(...); > ZipOutputStream zos = new ZipOutputStream(...); > while((entry = zis.getNextEntry()) != null) { > zos.putNextEntry(entry); > zis.transferTo(zos); > } > The problem with this code is that the zip file format does not record the > compression level used for deflation in its > entries. In general, it doesn't even mandate a predefined compression ratio > per compression level. Therefore the > compressed size recorded in a `ZipEntry` read from a zip file might differ > from the new compressed size produced by the > receiving `ZipOutputStream`. Such a difference will result in a > `ZipException` with the following message: > java.util.zip.ZipException: invalid entry compressed size (expected 12 but > got 7 bytes) > > The correct way of copying all entries from one zip file into another > requires the creation of a new `ZipEntry` or at > least resetting of the compressed size field. E.g.: > while((entry = zis.getNextEntry()) != null) { > ZipEntry newEntry = new ZipEntry(entry.getName()); > zos.putNextEntry(newEntry); > zis.transferTo(zos); > } > or: > while((entry = zis.getNextEntry()) != null) { > entry.setCompressedSize(-1); > zos.putNextEntry(entry); > zis.transferTo(zos); > } > Unfortunately, there's a lot of user code out there which gets this wrong and > uses the bad coding pattern described > before. Searching for `"java.util.zip.ZipException: invalid entry compressed > size (expected 12 but got 7 bytes)"` gives > ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of > instances of this anti-pattern on GitHub when > doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x > wrapper task][1] is affected as well as the > latest version of the [mockableAndroidJar task][2]. I've recently fixed two > occurrences of this pattern in OpenJDK (see > [JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them > (e.g. > [`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 > :). ### Description So while this has > clearly been a problem before, it apparently wasn't painful enough to trigger > any action from the side of the JDK. > However, recently quite some zlib forks with [superior deflate/inflate > performance have evolved][6]. Using them with > OpenJDK is quite straight-forward: one just has to configure the alternative > implementations by setting > `LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by > using these new zlib implementations for > selected services in production and the only reason why we haven't enabled > them by default until now is the problem > I've just described. The reason why these new libraries uncover the described > anti-pattern much more often is because > their compression ratio is slightly different from that of the default zlib > library. This can easily trigger a > `ZipException` even if an application is not using a different compression > levels but just a zip file created with > another zlib version. I'd therefore like to propose the following workaround > for the wrong > `ZipOutputStream.putNextEntry()` usage in user code: > - ignore the compressed size if it was implicitly determined from the zip > file and not explicitly set by calling > `ZipEntry.setCompressedSize()`. > > - Change the API-documentation of `ZipOutputStream.putNextEntry()` and > `JarOutputStream.putNextEntry()` to explain the > problem and why `putNextEntry()` will ignore the compressed size of a > `ZipEntry` if that was set implicitely when > reading that entry from a `ZipFile` or `ZipInputStream`. > > > ### Technical Details > > A zip file consists of a stream of File Entries followed by a Central > Directory (see [here for a more detailed > specification][7]). Each File Entry is composed of a Local File Header (LFH) > followed by the compressed Data and an > optional Data Descriptor. The LFH contains the File Name and among other > attributes the Compressed and Uncompressed > size and CRC of the Data. In the case where the latter three attributes are > not available at the time when the LFH is > created, this fact will be recorded in a flag of the LFH and will trigger the > creation of a Data Descriptor with the > corresponding information right after the Data section. Finally, the Central > Directory contains one Central Directory > File Header (CDFH) for each entry of the zip archive. The CDFH is an > extended version of the LFH and the ultimate > reference for the contents of the zip archive. The redundancy between LFH and > CDFH is a tribute to zip's long history > when it was used to store archives on multiple floppy discs and the CDFH > allowed to update the archive by only writing > to the last disc which contained the Central Directory. `ZipEntries` read > with `ZipInputStream.getNextEntry()` will > initially only contain the information from the LFH. Only after the next > entry was read (or after > `ZipInputStream.closeEntry()` was called explicitly), will the previously > read entry be updated with the data from the > Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at > all. On the other hand, `ZipFile` only > queries the Central Directory for `ZipEntry` information so all `ZipEntries` > returned by `ZipFile.entries()`, > `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain the > full Compressed and Uncompressed Size and > CRC information for each entry independently of the LFH contents. ### Risks > and Assumptions If we choose to ignore > the implicitly recorded compressed size in a `ZipEntry` read from a zip file > when writing it to a `ZipOutputStream`, > this will lead to zip files with incomplete information in the LFH and an > additional Data Descriptor as described > before. However, the result is still fully compatible to the zip file > specification. It's also not unusual, because by > default all new zip files created with `ZipOutputStream` will contain LFHs > without Compressed and Uncompressed Size and > CRC information and an additional Data Descriptor. Theoretically it is > possible to create new zip files with > `ZipOutputStream` class and Compressed and Uncompressed Size and CRC > information in the LFH but that's complex and > inefficient because it requires two steps. A first step to determine the crc > and compressed size of the data and a > second step to actually write the data to the `ZipOutputStream` (which will > compress it a second time). This is because > the current API offers no possibility to write already compressed data to a > `ZipOutputStream`. Consequently, the only > straight-forward way of creating zip files from Java which have all the data > in the LFH and no Data Descriptor is by > copying `ZipEntries` from an existing zip file with the buggy method > described before. This incidentally worked more or > less reliable for a long time but breaks miserably when using different zlib > implementations. Ignoring the implicitly > set compressed size of `ZipEntries` can easily fix this problem. I'm not > aware of any tool which can not handle such > files and if it exists it would have problems with the majority of Java > created zip files anyway (e.g. all jar-files > created with the jar tool have zip entries with incomplete LFH data and > additional Data Descriptor). Ignoring the > implicitly set compressed size of `ZipEntries` has no measurable performance > impact and will increase the size of zip > archives which used to have the complete file information in the LFH before > by 16 bytes per entry. On the other hand it > will give us the freedom to use whatever zip implementation we like :) [1]: > https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158 > [2]: > https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86 > [3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]: > https://bugs.openjdk.java.net/browse/JDK-8240235 [5]: > https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java > [6]: > https://github.com/simonis/zlib-bench/blob/master/Results.md [7]: > https://en.wikipedia.org/wiki/Zip_(file_format) I think as we start to move forward with the review and CSR, we should update the bug and PR description as the change is not a workaround but a change in behavior of the implementation ------------- PR: https://git.openjdk.java.net/jdk/pull/520