Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Archie Cobbs Mon, 29 Jul 2024 10:10:27 -0700

On Mon, 29 Jul 2024 14:27:28 GMT, Lance Andersen <lan...@openjdk.org> wrote:


> ... we really need more datapoint to better understand the risks/benefits in 
> order to make an informed decision.

Agreed.

Here's some what I've come up with after a little bit of research:

First, we shouldn't confuse GZIP with ZIP files. They are only indirectly 
related:
* The [DEFLATE format](https://en.wikipedia.org/wiki/Deflate) is a single file 
compression format
* The [GZIP format](https://en.wikipedia.org/wiki/Gzip) is a wrapper around the 
DEFLATE format.
* Individual entries in a [ZIP 
archive](https://en.wikipedia.org/wiki/ZIP_(file_format)) is typically 
DEFLATE'd, but never GZIP'd

The "overlap" that is relevant here is simply when the file you are storing in 
a ZIP file happens to itself be a GZIP file e.g. `foo.gz` (so yes, you are 
likely compressing `foo` twice).

Specifically, the failing test `GZIPInZip.java` is doing this:
1. Create a normal GZIP data file `b.gz`
1. Create a normal ZIP file
1. Write `b.gz` _plus one extra byte_ into the ZIP file
1. Close the ZIP file
1. Open the ZIP file
1. Find the `InputStream` corresponding to the entry for `b.gz`
1. Read and decompress it with a `GZIPInputStream`
1. Verify the extra byte was ignored

So this has nothing to do with the ZIP file format itself, rather it's related 
to the behavior of how other software might build ZIP files.

Nor does it directly relate to the GZIP format or how any GZIP decoder (taken 
in isolation) should behave.

So what I don't understand is what is the motivation for the behavior this test 
is verifying? It looks like, at best, a bug workaround for some unknown other 
broken software.

Here's the commit it was added:

commit 71f33254816a17ff741e0119e16db28181d7b43b
Author: Ivan Gerasimov <igera...@openjdk.org>
Date:   Tue Oct 15 21:15:17 2013 +0400

    8023431: Test java/util/zip/GZIP/GZIPInZip.java failed
    
    Properly close PipedStreams. Additional testing for malformed input
    
    Reviewed-by: darcy, sherman


Did you double check the unviewable issues? What does JDK-8023431 say?

I also took a look at how Apache's commons-compress `GzipCompressorInputStream` 
behaves.

`GzipCompressorInputStream` is the commons-compress version of 
`GZIPInputStream`. From what I can infer, part of its original motivation was 
that pre-JDK7 `GZIPInputStream` didn't support concatenated streams. This class 
provides an optional `decompressConcatenated` boolean constructor parameter 
(default `false`).

So how does `GzipCompressorInputStream` behave...?

* `IOException`'s thrown by the underlying input stream are never suppressed; 
this includes `IOException`'s thrown while reading "extra garbage" that may 
follow the first (if `decompressConcatenated = false`) or any (if 
`decompressConcatenated = true`) GZIP trailer frame.
* When `decompressConcatenated = false`, after reading the first GZIP trailer 
frame, the underlying input is `reset()` if possible to the stream's boundary. 
If `reset()` is not supported, some indeterminate number of extra bytes will 
have been read (note, this exception is missing from the Javadoc).
* When `decompressConcatenated = true`, the entire input stream is read, and it 
must contain only whole, valid GZIP streams with no trailing garbage.

Summary: GzipCompressorInputStream is never "lenient"; it either reads the 
whole stream (if `decompressConcatenated = true`) or else just the first GZIP 
stream and stops as soon as it can thereafter (exactly if `markSupported()`, 
othewise inexactly).

Side note: the reliance on `mark()`/`reset()` is a nice trick so to speak: if 
you can provide an input stream that supports it, you get precision in return, 
and this is accomplished without requiring any new API surface.

This brings up the question, how would commons-compress behave in the scenario 
being tested by the `GZIPInZip.java` test?

Answer: Let's assume you replace `new GZIPInputStream(zis)` with `new 
GzipCompressorInputStream(zis)` in that test. Then the test would succeed, 
because:
* By default, `GzipCompressorInputStream` sets `decompressConcatenated = false`.
* The extra byte might get read, but because `decompressConcatenated = false` 
decompression stops after the last byte of  `b.gz`

Overall the API and behavior design of `GzipCompressorInputStream` seems like a 
good one. I would even recommend that we adopt it, except that it's 
incompatible with the current behavior (we assume `decompressConcatenated = 
true` by default).

So here's a stab at what we might do (opening debate): Basically, we replicate 
the `GzipCompressorInputStream` behavior with the exception that our 
`allowConcatenated` defaults to `true` to preserve existing behavior. In 
summary:
* We utilize `mark()`/`reset()` to guarantee precise stopping behavior when 
`allowConcatenated = false` - but only if `markSupported()` returns true; 
otherwise, best effort/imprecise stop that may read past the GZIP end.
* We eliminate all "lenient" behavior - e.g., underlying `IOException`'s are 
never suppressed, and if concatenation is enabled, non-empty GZIP header frames 
must always be valid and the entire stream is consumed.
* Update `GZIPInZip.java` to replace `new GZIPInputStream(zis)` with `new 
GZIPInputStream(zis, false)` to unbreak it

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2256461082

Re: RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Reply via email to