On Tue, 6 Oct 2020 10:02:09 GMT, Volker Simonis <[email protected]> wrote:
> ### Summary
>
> Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code
> which can lead to the `ZipException "invalid
> entry compressed size"`.
> ### Motivation
>
> In general it is not safe to directly write a ZipEntry obtained from
> `ZipInputStream.getNextEntry()`,
> `ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with
> `ZipOutputStream.putNextEntry()` to a
> `ZipOutputStream` and then read the entries data from the `ZipInputStream`
> and write it to the `ZipOutputStream` as
> follows:
> ZipEntry entry;
> ZipInputStream zis = new ZipInputStream(...);
> ZipOutputStream zos = new ZipOutputStream(...);
> while((entry = zis.getNextEntry()) != null) {
> zos.putNextEntry(entry);
> zis.transferTo(zos);
> }
> The problem with this code is that the zip file format does not record the
> compression level used for deflation in its
> entries. In general, it doesn't even mandate a predefined compression ratio
> per compression level. Therefore the
> compressed size recorded in a `ZipEntry` read from a zip file might differ
> from the new compressed size produced by the
> receiving `ZipOutputStream`. Such a difference will result in a
> `ZipException` with the following message:
> java.util.zip.ZipException: invalid entry compressed size (expected 12 but
> got 7 bytes)
>
> The correct way of copying all entries from one zip file into another
> requires the creation of a new `ZipEntry` or at
> least resetting of the compressed size field. E.g.:
> while((entry = zis.getNextEntry()) != null) {
> ZipEntry newEntry = new ZipEntry(entry.getName());
> zos.putNextEntry(newEntry);
> zis.transferTo(zos);
> }
> or:
> while((entry = zis.getNextEntry()) != null) {
> entry.setCompressedSize(-1);
> zos.putNextEntry(entry);
> zis.transferTo(zos);
> }
> Unfortunately, there's a lot of user code out there which gets this wrong and
> uses the bad coding pattern described
> before. Searching for `"java.util.zip.ZipException: invalid entry compressed
> size (expected 12 but got 7 bytes)"` gives
> ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of
> instances of this anti-pattern on GitHub when
> doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x
> wrapper task][1] is affected as well as the
> latest version of the [mockableAndroidJar task][2]. I've recently fixed two
> occurrences of this pattern in OpenJDK (see
> [JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them
> (e.g.
> [`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999
> :). ### Description So while this has
> clearly been a problem before, it apparently wasn't painful enough to trigger
> any action from the side of the JDK.
> However, recently quite some zlib forks with [superior deflate/inflate
> performance have evolved][6]. Using them with
> OpenJDK is quite straight-forward: one just has to configure the alternative
> implementations by setting
> `LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by
> using these new zlib implementations for
> selected services in production and the only reason why we haven't enabled
> them by default until now is the problem
> I've just described. The reason why these new libraries uncover the described
> anti-pattern much more often is because
> their compression ratio is slightly different from that of the default zlib
> library. This can easily trigger a
> `ZipException` even if an application is not using a different compression
> levels but just a zip file created with
> another zlib version. I'd therefore like to propose the following workaround
> for the wrong
> `ZipOutputStream.putNextEntry()` usage in user code:
> - ignore the compressed size if it was implicitly determined from the zip
> file and not explicitly set by calling
> `ZipEntry.setCompressedSize()`.
>
> - Change the API-documentation of `ZipOutputStream.putNextEntry()` and
> `JarOutputStream.putNextEntry()` to explain the
> problem and why `putNextEntry()` will ignore the compressed size of a
> `ZipEntry` if that was set implicitely when
> reading that entry from a `ZipFile` or `ZipInputStream`.
>
>
> ### Technical Details
>
> A zip file consists of a stream of File Entries followed by a Central
> Directory (see [here for a more detailed
> specification][7]). Each File Entry is composed of a Local File Header (LFH)
> followed by the compressed Data and an
> optional Data Descriptor. The LFH contains the File Name and among other
> attributes the Compressed and Uncompressed
> size and CRC of the Data. In the case where the latter three attributes are
> not available at the time when the LFH is
> created, this fact will be recorded in a flag of the LFH and will trigger the
> creation of a Data Descriptor with the
> corresponding information right after the Data section. Finally, the Central
> Directory contains one Central Directory
> File Header (CDFH) for each entry of the zip archive. The CDFH is an
> extended version of the LFH and the ultimate
> reference for the contents of the zip archive. The redundancy between LFH and
> CDFH is a tribute to zip's long history
> when it was used to store archives on multiple floppy discs and the CDFH
> allowed to update the archive by only writing
> to the last disc which contained the Central Directory. `ZipEntries` read
> with `ZipInputStream.getNextEntry()` will
> initially only contain the information from the LFH. Only after the next
> entry was read (or after
> `ZipInputStream.closeEntry()` was called explicitly), will the previously
> read entry be updated with the data from the
> Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at
> all. On the other hand, `ZipFile` only
> queries the Central Directory for `ZipEntry` information so all `ZipEntries`
> returned by `ZipFile.entries()`,
> `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain the
> full Compressed and Uncompressed Size and
> CRC information for each entry independently of the LFH contents. ### Risks
> and Assumptions If we choose to ignore
> the implicitly recorded compressed size in a `ZipEntry` read from a zip file
> when writing it to a `ZipOutputStream`,
> this will lead to zip files with incomplete information in the LFH and an
> additional Data Descriptor as described
> before. However, the result is still fully compatible to the zip file
> specification. It's also not unusual, because by
> default all new zip files created with `ZipOutputStream` will contain LFHs
> without Compressed and Uncompressed Size and
> CRC information and an additional Data Descriptor. Theoretically it is
> possible to create new zip files with
> `ZipOutputStream` class and Compressed and Uncompressed Size and CRC
> information in the LFH but that's complex and
> inefficient because it requires two steps. A first step to determine the crc
> and compressed size of the data and a
> second step to actually write the data to the `ZipOutputStream` (which will
> compress it a second time). This is because
> the current API offers no possibility to write already compressed data to a
> `ZipOutputStream`. Consequently, the only
> straight-forward way of creating zip files from Java which have all the data
> in the LFH and no Data Descriptor is by
> copying `ZipEntries` from an existing zip file with the buggy method
> described before. This incidentally worked more or
> less reliable for a long time but breaks miserably when using different zlib
> implementations. Ignoring the implicitly
> set compressed size of `ZipEntries` can easily fix this problem. I'm not
> aware of any tool which can not handle such
> files and if it exists it would have problems with the majority of Java
> created zip files anyway (e.g. all jar-files
> created with the jar tool have zip entries with incomplete LFH data and
> additional Data Descriptor). Ignoring the
> implicitly set compressed size of `ZipEntries` has no measurable performance
> impact and will increase the size of zip
> archives which used to have the complete file information in the LFH before
> by 16 bytes per entry. On the other hand it
> will give us the freedom to use whatever zip implementation we like :) [1]:
> https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
> [2]:
> https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
> [3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]:
> https://bugs.openjdk.java.net/browse/JDK-8240235 [5]:
> https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java
> [6]:
> https://github.com/simonis/zlib-bench/blob/master/Results.md [7]:
> https://en.wikipedia.org/wiki/Zip_(file_format)
This pull request has now been integrated.
Changeset: 60159cff
Author: Volker Simonis <[email protected]>
URL: https://git.openjdk.java.net/jdk/commit/60159cff
Stats: 231 lines in 4 files changed: 223 ins; 0 del; 8 mod
8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's
compressed size
Reviewed-by: lancea, alanb
-------------
PR: https://git.openjdk.java.net/jdk/pull/520