Bug#647522: non-deterministic compression results with gzip -n9
Cyril, thanks for the test case. When I used 'valgrind' on it I found where gzip is accessing uninitialized data. I pushed into gzip master the patch at the end of this message; it fixed things for me. The Debian patch, which zeros out a lot of buffers, should work if gzip is compressing regular files, but may have problems in unusual cases if gzip compresses data from pipes, devices, or other non-regular files, because in that case short reads may later cause garbage to be put into the dictionary. So I suggest using the following patch instead. http://git.savannah.gnu.org/cgit/gzip.git/commit/?id=0a284baeaedca68017f46d2646e4c921aa98a90d From b9de47462b1b487cf4024b4c157ee5ac6c5849c3 Mon Sep 17 00:00:00 2001 From: Paul Eggert egg...@cs.ucla.edu Date: Sun, 18 Mar 2012 11:07:02 -0700 Subject: [PATCH] gzip: fix nondeterministic compression results Reported by Jakub Wilk in http://bugs.debian.org/647522. * deflate.c (fill_window): Don't let garbage pollute the dictionary. --- deflate.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/deflate.c b/deflate.c index 6c19552..5405f10 100644 --- a/deflate.c +++ b/deflate.c @@ -571,6 +571,8 @@ local void fill_window() n = read_buf((char*)window+strstart+lookahead, more); if (n == 0 || n == (unsigned)EOF) { eofile = 1; +/* Don't let garbage pollute the dictionary. */ +memzero (window + strstart + lookahead, MIN_MATCH - 1); } else { lookahead += n; } -- 1.7.6.5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
Paul Eggert egg...@cs.ucla.edu (08/02/2012): Thanks very much for the patch. But can someone who's looked into it please explain why 'window' needs to be zeroed out? This will save me time in reviewing the patch. Thanks. Welcome. Here's a slightly more detailed analysis. You can find attached the sample files I used (still from the ppl source package). I managed to track it down to deflate() (what a surprise!), so I decided to check the matches. Printing match_length only led me to a single difference, while printing both match_length and match_start (which according to a comment is set by longest_match()), I got two differences. To produce the below-quoted diff, using the attached patch, I did: | tar xfz ppl.tar.gz | gzip -9f README CREDITS match-2 | tar xfz ppl.tar.gz | gzip -9f CREDITS match-1 | diff -u match-2 match-1 Of course there's an extra file showing up in the diff, but skipping it, here's the diff: | match: 46 @ 19459 | match: 8 @ 15659 | match: 8 @ 15659 | -match: 7 @ 17193 | -match: 6 @ 17193 | +match: 6 @ 19510 | +match: 6 @ 19510 | [ no more matches ] Interestingly, the size of the CREDITS file is 19579. Mraw, KiBi. From cbe288044df0e149e6da3744f89eb49ce32e6c2a Mon Sep 17 00:00:00 2001 From: Cyril Brulebois k...@debian.org Date: Mon, 13 Feb 2012 01:31:36 +0100 Subject: [PATCH] Pin-point dirty window bug. --- deflate.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/deflate.c b/deflate.c index 1b3ac52..606a2e5 100644 --- a/deflate.c +++ b/deflate.c @@ -757,6 +757,7 @@ off_t deflate() */ match_length = longest_match (hash_head); /* longest_match() sets match_start */ +printf(match: %u @ %u\n, match_length, match_start); if (match_length lookahead) match_length = lookahead; /* Ignore a length 3 match if it is too distant: */ -- 1.7.9 ppl.tar.gz Description: Binary data signature.asc Description: Digital signature
Bug#647522: non-deterministic compression results with gzip -n9
Zack Weinberg za...@panix.com (07/02/2012): I've seen inexplicable nondeterminism like this before, and quite often it's turned out to be controlled by the total size of the command line argument area (that is, argv + environ + ELF auxv). FWIW, a quick look on kfreebsd-amd64 with ppl's CREDITS file led me to: gzip -9nf CREDITS → 6343 bytes running dh_installdocs dh_compress with DH_VERBOSE, I noticed the following command line: gzip -9nf README CREDITS and the result → 6344 bytes. Playing on amd64: cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} . cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf CREDITS README cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz -rw-r--r-- 1 cbrulebois cbrulebois 6343 Feb 8 12:34 CREDITS.gz -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb 8 12:34 README.gz cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} . cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf README CREDITS cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz -rw-r--r-- 1 cbrulebois cbrulebois 6344 Feb 8 12:34 CREDITS.gz -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb 8 12:34 README.gz It looks to me like it shouldn't be hard to figure out what happens here given the few tests I performed with the above command lines. On a few iterations, reproducibility (with a given input command line) doesn't seem to be an issue. Mraw, KiBi. signature.asc Description: Digital signature
Bug#647522: non-deterministic compression results with gzip -n9
Cyril Brulebois k...@debian.org (08/02/2012): Playing on amd64: cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} . cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf CREDITS README cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz -rw-r--r-- 1 cbrulebois cbrulebois 6343 Feb 8 12:34 CREDITS.gz -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb 8 12:34 README.gz cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} . cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf README CREDITS cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz -rw-r--r-- 1 cbrulebois cbrulebois 6344 Feb 8 12:34 CREDITS.gz -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb 8 12:34 README.gz It looks to me like it shouldn't be hard to figure out what happens here given the few tests I performed with the above command lines. On a few iterations, reproducibility (with a given input command line) doesn't seem to be an issue. I think at least the attached patch won't hurt (when the DYN_ALLOC part is fixed; and possibly turning that into a MEMSET-like macro). And given dh_compress is passing files in an arbitrary order (it's using “find” to detect files which needs to be compressed), I think we have an explanation about the apparently-hard-to-reproduce issues. Mraw, KiBi. diff --git a/gzip.c b/gzip.c index b867350..1153bde 100644 --- a/gzip.c +++ b/gzip.c @@ -561,6 +561,19 @@ int main (int argc, char **argv) SET_BINARY_MODE(fileno(stdout)); } while (optind argc) { + + /* Make sure buffers are reset to 0 to ensure reproducibility when handling several files */ + ZEROIFY(uch, inbuf, INBUFSIZ +INBUF_EXTRA); + ZEROIFY(uch, outbuf, OUTBUFSIZ+OUTBUF_EXTRA); + ZEROIFY(ush, d_buf, DIST_BUFSIZE); + ZEROIFY(uch, window, 2L*WSIZE); +#ifndef MAXSEG_64K + ZEROIFY(ush, tab_prefix, 1LBITS); +#else + ZEROIFY(ush, tab_prefix0, 1L(BITS-1)); + ZEROIFY(ush, tab_prefix1, 1L(BITS-1)); +#endif + treat_file(argv[optind++]); } } else { /* Standard input */ diff --git a/gzip.h b/gzip.h index 5270c56..7a1e84b 100644 --- a/gzip.h +++ b/gzip.h @@ -119,11 +119,13 @@ extern int method; /* compression method */ array = (type*)fcalloc((size_t)(((size)+1L)/2), 2*sizeof(type)); \ if (!array) xalloc_die (); \ } +# error ZEROIFY needs an implementation, KiBi is lazy # define FREE(array) {if (array != NULL) fcfree(array), array=NULL;} #else # define EXTERN(type, array) extern type array[] # define DECLARE(type, array, size) type array[size] # define ALLOC(type, array, size) +# define ZEROIFY(type, array, size) { for (int i=0; isize; i++) { array[i] = 0; } } # define FREE(array) #endif signature.asc Description: Digital signature
Bug#647522: non-deterministic compression results with gzip -n9
I think at least the attached patch won't hurt (when the DYN_ALLOC part is fixed; and possibly turning that into a MEMSET-like macro). Just an idea, but couldn't ZEROIFY in the DYN_ALLOC part be defined as free() and subsequent calloc() of the arrays preserving their size? - Fabian -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
Thanks Cyril for tracking this down. On Wed, Feb 08, 2012 at 01:15:00PM +0100, Cyril Brulebois wrote: I think at least the attached patch won't hurt (when the DYN_ALLOC part is fixed; and possibly turning that into a MEMSET-like macro). And given dh_compress is passing files in an arbitrary order (it's using “find” to detect files which needs to be compressed), I think we have an explanation about the apparently-hard-to-reproduce issues. After some quick testing, ZEROIFY on th window is enough. However, clearing all buffers is a good defensive strategy I think. Mraw, KiBi. diff --git a/gzip.c b/gzip.c index b867350..1153bde 100644 --- a/gzip.c +++ b/gzip.c @@ -561,6 +561,19 @@ int main (int argc, char **argv) SET_BINARY_MODE(fileno(stdout)); } while (optind argc) { + + /* Make sure buffers are reset to 0 to ensure reproducibility when handling several files */ + ZEROIFY(uch, inbuf, INBUFSIZ +INBUF_EXTRA); + ZEROIFY(uch, outbuf, OUTBUFSIZ+OUTBUF_EXTRA); + ZEROIFY(ush, d_buf, DIST_BUFSIZE); + ZEROIFY(uch, window, 2L*WSIZE); +#ifndef MAXSEG_64K + ZEROIFY(ush, tab_prefix, 1LBITS); +#else + ZEROIFY(ush, tab_prefix0, 1L(BITS-1)); + ZEROIFY(ush, tab_prefix1, 1L(BITS-1)); +#endif + treat_file(argv[optind++]); } } else { /* Standard input */ diff --git a/gzip.h b/gzip.h index 5270c56..7a1e84b 100644 --- a/gzip.h +++ b/gzip.h @@ -119,11 +119,13 @@ extern int method; /* compression method */ array = (type*)fcalloc((size_t)(((size)+1L)/2), 2*sizeof(type)); \ if (!array) xalloc_die (); \ } +# error ZEROIFY needs an implementation, KiBi is lazy # define FREE(array) {if (array != NULL) fcfree(array), array=NULL;} #else # define EXTERN(type, array) extern type array[] # define DECLARE(type, array, size) type array[size] # define ALLOC(type, array, size) +# define ZEROIFY(type, array, size) { for (int i=0; isize; i++) { array[i] = 0; } } # define FREE(array) #endif -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
Thanks very much for the patch. But can someone who's looked into it please explain why 'window' needs to be zeroed out? This will save me time in reviewing the patch. Thanks. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
Hello there, using a small tool I wrote a while back, it's possible to see that the actuall differences are in the encoding of the last few bytes of cleartext. I've copied the decoding tool to http://youam.net/devel/rfc1952-dec Attached is the difference of the decoded files from 20111208102738.ga7...@afflict.kos.to. As to why this happens: no idea (yet) Uli --- NEWS.amd64.dec 2012-02-07 12:19:19.649431283 +0100 +++ NEWS.armel.dec 2012-02-07 12:19:26.853554306 +0100 @@ -3689,13 +3689,12 @@ - lookback len 3 distance 3386 - literal code 65 ('e') - literal code 72 ('r') - - lookback len 5 distance 6517 + - lookback len 5 distance 10888 - eos skip: - 0 - 0 - 0 -- 0 crc: f68d6c5a isize: 13696 --- '' --- ChangeLog.pre-2-2.amd64.dec 2012-02-07 12:17:54.787982067 +0100 +++ ChangeLog.pre-2-2.armel.dec 2012-02-07 12:18:02.960121626 +0100 @@ -11052,9 +11052,7 @@ - literal code 63 ('c') - lookback len 12 distance 1661 - lookback len 5 distance 3600 - - literal code 2e ('.') - - literal code 0a - - literal code 0a + - lookback len 3 distance 162 - eos skip: - 0
Bug#647522: non-deterministic compression results with gzip -n9
I've seen inexplicable nondeterminism like this before, and quite often it's turned out to be controlled by the total size of the command line argument area (that is, argv + environ + ELF auxv). Changes in how big that is change the initial stack pointer address, and while that *shouldn't* matter to anything, sometimes it does. A shell loop of the form export X= i=0 while [ i -lt 8192 ]; do perform test X=${X}x i=$((i+1)) done should catch it if this is the cause. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
On 12/08/11 02:27, Riku Voipio wrote: According to gzip RFC, the last 4 bytes are ISIZE, which should be uncompressed input size. Which leaves me rather baffled how that can differ on same input files - and how gunzip is completly happy with both versions of compressed file, producing the same output. I looked only at NEWS.amd64.gz and NEWS.armel.gz. For those two files your diagnosis does not seem to be right, as these two files do not differ in the last 4 bytes, but in bytes before then: $ od -tx1 NEWS.amd64.gz NEWS.amd64.gz.od $ od -tx1 NEWS.armel.gz NEWS.armel.gz.od $ diff -u NEWS.*.gz.od --- NEWS.amd64.gz.od2011-12-09 12:03:41.090594754 -0800 +++ NEWS.armel.gz.od2011-12-09 12:03:57.298663371 -0800 @@ -317,6 +317,6 @@ 0011700 fa 9f da 9b 92 ad 57 44 19 45 c5 42 e5 b6 d9 c2 0011720 7e 80 02 bd 58 94 33 74 ba 0a 62 24 52 7b 35 33 0011740 b2 87 51 76 b7 af cc 7f 09 b0 2d 14 d6 8d f9 4d -0011760 94 51 39 49 cd 87 2e ff 0b 5a 6c 8d f6 80 35 00 +0011760 94 51 39 49 cd f7 50 ff 17 5a 6c 8d f6 80 35 00 0012000 00 0012001 So the differences are not in ISIZE. Can you track down what's actually differing and why? (That would save some time for me when debugging) Thanks. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#647522: non-deterministic compression results with gzip -n9
I should add that it's OK (from the point of view of the RFCs) if gzip produces different outputs given the same inputs when compressing. The RFCs allow that and presumably other gzip implementations do that. All that's required is that compress+decompress result in a copy of the original. That being said, it's nicer if gzip is deterministic and it'd be helpful to to get to the bottom of this and make the nondeterminism go away in future versions of gzip. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org