On Tue, 13 Oct 2020 19:07:33 GMT, Volker Simonis <simo...@openjdk.org> wrote:
>> ### Summary >> >> Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code >> which can lead to the `ZipException "invalid >> entry compressed size"`. >> ### Motivation >> >> In general it is not safe to directly write a ZipEntry obtained from >> `ZipInputStream.getNextEntry()`, >> `ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with >> `ZipOutputStream.putNextEntry()` to a >> `ZipOutputStream` and then read the entries data from the `ZipInputStream` >> and write it to the `ZipOutputStream` as >> follows: >> ZipEntry entry; >> ZipInputStream zis = new ZipInputStream(...); >> ZipOutputStream zos = new ZipOutputStream(...); >> while((entry = zis.getNextEntry()) != null) { >> zos.putNextEntry(entry); >> zis.transferTo(zos); >> } >> The problem with this code is that the zip file format does not record the >> compression level used for deflation in its >> entries. In general, it doesn't even mandate a predefined compression ratio >> per compression level. Therefore the >> compressed size recorded in a `ZipEntry` read from a zip file might differ >> from the new compressed size produced by the >> receiving `ZipOutputStream`. Such a difference will result in a >> `ZipException` with the following message: >> java.util.zip.ZipException: invalid entry compressed size (expected 12 but >> got 7 bytes) >> >> The correct way of copying all entries from one zip file into another >> requires the creation of a new `ZipEntry` or at >> least resetting of the compressed size field. E.g.: >> while((entry = zis.getNextEntry()) != null) { >> ZipEntry newEntry = new ZipEntry(entry.getName()); >> zos.putNextEntry(newEntry); >> zis.transferTo(zos); >> } >> or: >> while((entry = zis.getNextEntry()) != null) { >> entry.setCompressedSize(-1); >> zos.putNextEntry(entry); >> zis.transferTo(zos); >> } >> Unfortunately, there's a lot of user code out there which gets this wrong >> and uses the bad coding pattern described >> before. Searching for `"java.util.zip.ZipException: invalid entry compressed >> size (expected 12 but got 7 bytes)"` gives >> ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of >> instances of this anti-pattern on GitHub when >> doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x >> wrapper task][1] is affected as well as the >> latest version of the [mockableAndroidJar task][2]. I've recently fixed two >> occurrences of this pattern in OpenJDK (see >> [JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them >> (e.g. >> [`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 >> :). ### Description So while this has >> clearly been a problem before, it apparently wasn't painful enough to >> trigger any action from the side of the JDK. >> However, recently quite some zlib forks with [superior deflate/inflate >> performance have evolved][6]. Using them with >> OpenJDK is quite straight-forward: one just has to configure the alternative >> implementations by setting >> `LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by >> using these new zlib implementations for >> selected services in production and the only reason why we haven't enabled >> them by default until now is the problem >> I've just described. The reason why these new libraries uncover the >> described anti-pattern much more often is because >> their compression ratio is slightly different from that of the default zlib >> library. This can easily trigger a >> `ZipException` even if an application is not using a different compression >> levels but just a zip file created with >> another zlib version. I'd therefore like to propose the following >> workaround for the wrong >> `ZipOutputStream.putNextEntry()` usage in user code: >> - ignore the compressed size if it was implicitly determined from the zip >> file and not explicitly set by calling >> `ZipEntry.setCompressedSize()`. >> >> - Change the API-documentation of `ZipOutputStream.putNextEntry()` and >> `JarOutputStream.putNextEntry()` to explain the >> problem and why `putNextEntry()` will ignore the compressed size of a >> `ZipEntry` if that was set implicitely when >> reading that entry from a `ZipFile` or `ZipInputStream`. >> >> >> ### Technical Details >> >> A zip file consists of a stream of File Entries followed by a Central >> Directory (see [here for a more detailed >> specification][7]). Each File Entry is composed of a Local File Header (LFH) >> followed by the compressed Data and an >> optional Data Descriptor. The LFH contains the File Name and among other >> attributes the Compressed and Uncompressed >> size and CRC of the Data. In the case where the latter three attributes are >> not available at the time when the LFH is >> created, this fact will be recorded in a flag of the LFH and will trigger >> the creation of a Data Descriptor with the >> corresponding information right after the Data section. Finally, the Central >> Directory contains one Central Directory >> File Header (CDFH) for each entry of the zip archive. The CDFH is an >> extended version of the LFH and the ultimate >> reference for the contents of the zip archive. The redundancy between LFH >> and CDFH is a tribute to zip's long history >> when it was used to store archives on multiple floppy discs and the CDFH >> allowed to update the archive by only writing >> to the last disc which contained the Central Directory. `ZipEntries` read >> with `ZipInputStream.getNextEntry()` will >> initially only contain the information from the LFH. Only after the next >> entry was read (or after >> `ZipInputStream.closeEntry()` was called explicitly), will the previously >> read entry be updated with the data from the >> Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at >> all. On the other hand, `ZipFile` only >> queries the Central Directory for `ZipEntry` information so all `ZipEntries` >> returned by `ZipFile.entries()`, >> `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain >> the full Compressed and Uncompressed Size and >> CRC information for each entry independently of the LFH contents. ### Risks >> and Assumptions If we choose to ignore >> the implicitly recorded compressed size in a `ZipEntry` read from a zip file >> when writing it to a `ZipOutputStream`, >> this will lead to zip files with incomplete information in the LFH and an >> additional Data Descriptor as described >> before. However, the result is still fully compatible to the zip file >> specification. It's also not unusual, because by >> default all new zip files created with `ZipOutputStream` will contain LFHs >> without Compressed and Uncompressed Size and >> CRC information and an additional Data Descriptor. Theoretically it is >> possible to create new zip files with >> `ZipOutputStream` class and Compressed and Uncompressed Size and CRC >> information in the LFH but that's complex and >> inefficient because it requires two steps. A first step to determine the crc >> and compressed size of the data and a >> second step to actually write the data to the `ZipOutputStream` (which will >> compress it a second time). This is because >> the current API offers no possibility to write already compressed data to a >> `ZipOutputStream`. Consequently, the only >> straight-forward way of creating zip files from Java which have all the data >> in the LFH and no Data Descriptor is by >> copying `ZipEntries` from an existing zip file with the buggy method >> described before. This incidentally worked more or >> less reliable for a long time but breaks miserably when using different zlib >> implementations. Ignoring the implicitly >> set compressed size of `ZipEntries` can easily fix this problem. I'm not >> aware of any tool which can not handle such >> files and if it exists it would have problems with the majority of Java >> created zip files anyway (e.g. all jar-files >> created with the jar tool have zip entries with incomplete LFH data and >> additional Data Descriptor). Ignoring the >> implicitly set compressed size of `ZipEntries` has no measurable performance >> impact and will increase the size of zip >> archives which used to have the complete file information in the LFH before >> by 16 bytes per entry. On the other hand it >> will give us the freedom to use whatever zip implementation we like :) [1]: >> https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158 >> [2]: >> https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86 >> [3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]: >> https://bugs.openjdk.java.net/browse/JDK-8240235 [5]: >> https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java >> [6]: >> https://github.com/simonis/zlib-bench/blob/master/Results.md [7]: >> https://en.wikipedia.org/wiki/Zip_(file_format) > > Volker Simonis has refreshed the contents of this pull request, and previous > commits have been removed. The incremental > views will show differences compared to the previous content of the PR. test/jdk/java/util/zip/CopyZipFile.java line 140: > 138: // This means that all ZipEntry objects returned from ZipFile > will have correct > 139: // settings for these fields. > 140: // I the compression level was different in the initial zip file > (which we can't find typo I -> If ------------- PR: https://git.openjdk.java.net/jdk/pull/520