Greetings, ## Problem
I have been working with Common Crawl <http://commoncrawl.org/the-data/get-started/> archives which are files consisting of 50,000 to 55,000 small GZIP'd files concatenated together and made available through Amazon S3. When decompressing these files on the fly using GZIPInputStream it consistently ends prematurely as per bug JDK-8081450 <https://bugs.openjdk.java.net/browse/JDK-8081450>. Details on what I was trying to do and some demonstration code can be found on Stack Overflow <http://stackoverflow.com/questions/41400810/gzipinputstream-closes-prematurely-when-decompressing-httpinputstream>. ## Root Cause The root cause seems to be line 231 in GZIPInputStream.java <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/0e4fc29a5ce4/src/share/classes/java/util/zip/GZIPInputStream.java> method readTrailer() if (this.in.available() > 0 || n > 26) { > > where *this.in <http://this.in>* is the input stream passed to the constructor. readTrailer() calls available() to determine if there are bytes left in the input stream that should be checked to see if there is an additional GZIP file in the stream. Some input streams, like ByteArrayInputStream which is used in the itreg test GZIPInputStreamRead.java <http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/0e4fc29a5ce4/test/java/util/zip/GZIP/GZIPInputStreamRead.java>, return the number of bytes available to read. However, input streams like SequenceInputStream or HTTPInputStream will return 0 if a blocking IO call is needed to refill the buffer. For these types of input streams when the last thing in the read buffer is the end of a GZIP file the method .available() will return 0 and GZIPInputStream will end prematurely instead of continuing even if the stream is not closed. ## Proposed Solution After a brief review of InputStream subclasses it seems impossible in general to check if more bytes are available in an input stream without trying to read the stream. So instead of checking available() just assume that there is another GZIP file in the stream and call readHeader(). If the input stream is in fact empty readHeader() properly handles the end of the stream. ## Patch and jtreg test The attached file *GZIPInputStream.java.patch.txt* implements the proposed solution. The patch was applied against the jdk8u jdk repository changeset 12039:e5b1823a897e (Dec 20). The attached file *GZIPHttpStreamRead.java* is a jtreg test that is a modified copy of the existing GZIPInputStreamRead.java test. In this test the ByteArrayInputStream is wrapped in a very poor mock of an HTTPInputStream. The mock always returns 0 when available() is called which causes the test to fail eventually when reading the randomly generated concatenated GZIP files. The attached file *GZIPBufferEndsAtBoundry.java* is a jtreg test using an input stream that intentionally returns only the first complete GZIP file in a concatenated GZIP stream and then available() returns zero. It will not return any content from the next GZIP file until read() is called again. This is my attempt to explicitly reproduce the root cause. Before the patch is applied the two new tests fail. After the patch is applied the two new tests succeed and the other four tests still succeed. The patch also addresses the demonstration code submitted in the bug report for JDK-8081450 <https://bugs.openjdk.java.net/browse/JDK-8081450>. After the patch is applied the demonstration code returns "hello world" as expected. ## Work Around This problem is not too difficult to work around. I created a subclass of InputStream that wrapped the HTTPInputStream but overrode .available() so that it always return a number > 0. Then GZIPInputStream always read to the end of the stream correctly. Example work around code is in my answer on Stack Overflow <http://stackoverflow.com/questions/41400810/gzipinputstream-closes-prematurely-when-decompressing-httpinputstream> . Thank you, Jeff Harwell
# HG changeset patch # User Jeff Harwell <[email protected]> # Date 1483565634 28800 # Wed Jan 04 13:33:54 2017 -0800 # Node ID 97437f51e996766c89de2e0a8662f58e3c0d45cd # Parent a0c6f393b603a858ea970e3cd1d256ffd3be6789 Patched GZIPInputStream to pass new Test Suite diff --git a/src/share/classes/java/util/zip/GZIPInputStream.java b/src/share/classes/java/util/zip/GZIPInputStream.java --- a/src/share/classes/java/util/zip/GZIPInputStream.java +++ b/src/share/classes/java/util/zip/GZIPInputStream.java @@ -224,23 +224,28 @@ (readUInt(in) != (inf.getBytesWritten() & 0xffffffffL))) throw new ZipException("Corrupt GZIP trailer"); - // If there are more bytes available in "in" or - // the leftover in the "inf" is > 26 bytes: - // this.trailer(8) + next.header.min(10) + next.trailer(8) - // try concatenated case - if (this.in.available() > 0 || n > 26) { - int m = 8; // this.trailer - try { - m += readHeader(in); // next.header - } catch (IOException ze) { - return true; // ignore any malformed, do nothing - } - inf.reset(); - if (n > m) - inf.setInput(buf, len - n + m, n - m); - return false; + // We cannot know if the input stream has remaining bytes so + // assume that it does and try to read the next header. If there isn't + // one we will get a EOFException (see this.readUByte). Catch + // this exception and return true (eos reached) + int m = 8; // this.trailer + try { + m += readHeader(in); // next.header + } catch (EOFException ze) { + // We are at the end of the stream + return true; + } catch (IOException ze) { + // We were not at the end of the stream but what was + // left in the stream didn't make any sense, ignore it + // we already have what we came for + return true; } - return true; + // We found a valid header for an additional GZIP file + inf.reset(); + if (n > m) { + inf.setInput(buf, len - n + m, n - m); + } + return false; // there is more GZIP data, keep going (concatenated gzip data set) } /*
