Bug#647522: non-deterministic compression results with gzip -n9

2012-03-18 Thread Paul Eggert
Cyril, thanks for the test case.  When I used 'valgrind' on it
I found where gzip is accessing uninitialized data.  I pushed
into gzip master the patch at the end of this message; it fixed
things for me.

The Debian patch, which zeros out a lot of buffers, should
work if gzip is compressing regular files, but may have
problems in unusual cases if gzip compresses data from
pipes, devices, or other non-regular files, because in that
case short reads may later cause garbage to be put into the
dictionary.  So I suggest using the following patch instead.

http://git.savannah.gnu.org/cgit/gzip.git/commit/?id=0a284baeaedca68017f46d2646e4c921aa98a90d

From b9de47462b1b487cf4024b4c157ee5ac6c5849c3 Mon Sep 17 00:00:00 2001
From: Paul Eggert egg...@cs.ucla.edu
Date: Sun, 18 Mar 2012 11:07:02 -0700
Subject: [PATCH] gzip: fix nondeterministic compression results

Reported by Jakub Wilk in http://bugs.debian.org/647522.
* deflate.c (fill_window): Don't let garbage pollute the dictionary.
---
 deflate.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/deflate.c b/deflate.c
index 6c19552..5405f10 100644
--- a/deflate.c
+++ b/deflate.c
@@ -571,6 +571,8 @@ local void fill_window()
 n = read_buf((char*)window+strstart+lookahead, more);
 if (n == 0 || n == (unsigned)EOF) {
 eofile = 1;
+/* Don't let garbage pollute the dictionary.  */
+memzero (window + strstart + lookahead, MIN_MATCH - 1);
 } else {
 lookahead += n;
 }
-- 
1.7.6.5





-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2012-02-12 Thread Cyril Brulebois
Paul Eggert egg...@cs.ucla.edu (08/02/2012):
 Thanks very much for the patch.  But can someone who's looked into it
 please explain why 'window' needs to be zeroed out?  This will save me
 time in reviewing the patch.  Thanks.

Welcome. Here's a slightly more detailed analysis. You can find attached
the sample files I used (still from the ppl source package). I managed
to track it down to deflate() (what a surprise!), so I decided to check
the matches. Printing match_length only led me to a single difference,
while printing both match_length and match_start (which according to a
comment is set by longest_match()), I got two differences.

To produce the below-quoted diff, using the attached patch, I did:
| tar xfz ppl.tar.gz
| gzip -9f README CREDITS  match-2
| tar xfz ppl.tar.gz
| gzip -9f CREDITS  match-1
| diff -u match-2 match-1

Of course there's an extra file showing up in the diff, but skipping it,
here's the diff:
|  match: 46 @ 19459
|  match: 8 @ 15659
|  match: 8 @ 15659
| -match: 7 @ 17193
| -match: 6 @ 17193
| +match: 6 @ 19510
| +match: 6 @ 19510
| [ no more matches ]

Interestingly, the size of the CREDITS file is 19579.

Mraw,
KiBi.
From cbe288044df0e149e6da3744f89eb49ce32e6c2a Mon Sep 17 00:00:00 2001
From: Cyril Brulebois k...@debian.org
Date: Mon, 13 Feb 2012 01:31:36 +0100
Subject: [PATCH] Pin-point dirty window bug.

---
 deflate.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/deflate.c b/deflate.c
index 1b3ac52..606a2e5 100644
--- a/deflate.c
+++ b/deflate.c
@@ -757,6 +757,7 @@ off_t deflate()
  */
 match_length = longest_match (hash_head);
 /* longest_match() sets match_start */
+printf(match: %u @ %u\n, match_length, match_start);
 if (match_length  lookahead) match_length = lookahead;
 
 /* Ignore a length 3 match if it is too distant: */
-- 
1.7.9



ppl.tar.gz
Description: Binary data


signature.asc
Description: Digital signature


Bug#647522: non-deterministic compression results with gzip -n9

2012-02-08 Thread Cyril Brulebois
Zack Weinberg za...@panix.com (07/02/2012):
 I've seen inexplicable nondeterminism like this before, and quite
 often it's turned out to be controlled by the total size of the
 command line argument area (that is, argv + environ + ELF auxv).

FWIW, a quick look on kfreebsd-amd64 with ppl's CREDITS file led me to:
  gzip -9nf CREDITS → 6343 bytes

running dh_installdocs  dh_compress with DH_VERBOSE, I noticed the
following command line:
  gzip -9nf README CREDITS
and the result → 6344 bytes.

Playing on amd64:
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} .
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf CREDITS README
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz
-rw-r--r-- 1 cbrulebois cbrulebois 6343 Feb  8 12:34 CREDITS.gz
-rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb  8 12:34 README.gz
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} .
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf README CREDITS
cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz
-rw-r--r-- 1 cbrulebois cbrulebois 6344 Feb  8 12:34 CREDITS.gz
-rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb  8 12:34 README.gz

It looks to me like it shouldn't be hard to figure out what happens here
given the few tests I performed with the above command lines. On a few
iterations, reproducibility (with a given input command line) doesn't
seem to be an issue.

Mraw,
KiBi.


signature.asc
Description: Digital signature


Bug#647522: non-deterministic compression results with gzip -n9

2012-02-08 Thread Cyril Brulebois
Cyril Brulebois k...@debian.org (08/02/2012):
 Playing on amd64:
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} .
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf CREDITS README
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz
 -rw-r--r-- 1 cbrulebois cbrulebois 6343 Feb  8 12:34 CREDITS.gz
 -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb  8 12:34 README.gz
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ cp ../ppl-pristine/{CREDITS,README} .
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ gzip -9nf README CREDITS
 cbrulebois@Cygnus:/tmp/ppl-0.11.2$ ls -l *gz
 -rw-r--r-- 1 cbrulebois cbrulebois 6344 Feb  8 12:34 CREDITS.gz
 -rw-r--r-- 1 cbrulebois cbrulebois 8745 Feb  8 12:34 README.gz
 
 It looks to me like it shouldn't be hard to figure out what happens here
 given the few tests I performed with the above command lines. On a few
 iterations, reproducibility (with a given input command line) doesn't
 seem to be an issue.

I think at least the attached patch won't hurt (when the DYN_ALLOC part
is fixed; and possibly turning that into a MEMSET-like macro).

And given dh_compress is passing files in an arbitrary order (it's using
“find” to detect files which needs to be compressed), I think we have
an explanation about the apparently-hard-to-reproduce issues.

Mraw,
KiBi.
diff --git a/gzip.c b/gzip.c
index b867350..1153bde 100644
--- a/gzip.c
+++ b/gzip.c
@@ -561,6 +561,19 @@ int main (int argc, char **argv)
 	SET_BINARY_MODE(fileno(stdout));
 	}
 while (optind  argc) {
+
+	/* Make sure buffers are reset to 0 to ensure reproducibility when handling several files */
+	ZEROIFY(uch, inbuf,  INBUFSIZ +INBUF_EXTRA);
+	ZEROIFY(uch, outbuf, OUTBUFSIZ+OUTBUF_EXTRA);
+	ZEROIFY(ush, d_buf,  DIST_BUFSIZE);
+	ZEROIFY(uch, window, 2L*WSIZE);
+#ifndef MAXSEG_64K
+	ZEROIFY(ush, tab_prefix, 1LBITS);
+#else
+	ZEROIFY(ush, tab_prefix0, 1L(BITS-1));
+	ZEROIFY(ush, tab_prefix1, 1L(BITS-1));
+#endif
+
 	treat_file(argv[optind++]);
 	}
 } else {  /* Standard input */
diff --git a/gzip.h b/gzip.h
index 5270c56..7a1e84b 100644
--- a/gzip.h
+++ b/gzip.h
@@ -119,11 +119,13 @@ extern int method; /* compression method */
   array = (type*)fcalloc((size_t)(((size)+1L)/2), 2*sizeof(type)); \
   if (!array) xalloc_die (); \
}
+#  error ZEROIFY needs an implementation, KiBi is lazy
 #  define FREE(array) {if (array != NULL) fcfree(array), array=NULL;}
 #else
 #  define EXTERN(type, array)  extern type array[]
 #  define DECLARE(type, array, size)  type array[size]
 #  define ALLOC(type, array, size)
+#  define ZEROIFY(type, array, size) { for (int i=0; isize; i++) { array[i] = 0; } }
 #  define FREE(array)
 #endif
 


signature.asc
Description: Digital signature


Bug#647522: non-deterministic compression results with gzip -n9

2012-02-08 Thread Fabian Greffrath

I think at least the attached patch won't hurt (when the DYN_ALLOC part
is fixed; and possibly turning that into a MEMSET-like macro).


Just an idea, but couldn't ZEROIFY in the DYN_ALLOC part be defined as 
free() and subsequent calloc() of the arrays preserving their size?


 - Fabian




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2012-02-08 Thread Riku Voipio
Thanks Cyril for tracking this down.

On Wed, Feb 08, 2012 at 01:15:00PM +0100, Cyril Brulebois wrote:
 I think at least the attached patch won't hurt (when the DYN_ALLOC part
 is fixed; and possibly turning that into a MEMSET-like macro).

 And given dh_compress is passing files in an arbitrary order (it's using
 “find” to detect files which needs to be compressed), I think we have
 an explanation about the apparently-hard-to-reproduce issues.

After some quick testing, ZEROIFY on th window is enough. However, clearing
all buffers is a good defensive strategy I think.

 Mraw,
 KiBi.

 diff --git a/gzip.c b/gzip.c
 index b867350..1153bde 100644
 --- a/gzip.c
 +++ b/gzip.c
 @@ -561,6 +561,19 @@ int main (int argc, char **argv)
   SET_BINARY_MODE(fileno(stdout));
   }
  while (optind  argc) {
 +
 + /* Make sure buffers are reset to 0 to ensure reproducibility when 
 handling several files */
 + ZEROIFY(uch, inbuf,  INBUFSIZ +INBUF_EXTRA);
 + ZEROIFY(uch, outbuf, OUTBUFSIZ+OUTBUF_EXTRA);
 + ZEROIFY(ush, d_buf,  DIST_BUFSIZE);
 + ZEROIFY(uch, window, 2L*WSIZE);
 +#ifndef MAXSEG_64K
 + ZEROIFY(ush, tab_prefix, 1LBITS);
 +#else
 + ZEROIFY(ush, tab_prefix0, 1L(BITS-1));
 + ZEROIFY(ush, tab_prefix1, 1L(BITS-1));
 +#endif
 +
   treat_file(argv[optind++]);
   }
  } else {  /* Standard input */
 diff --git a/gzip.h b/gzip.h
 index 5270c56..7a1e84b 100644
 --- a/gzip.h
 +++ b/gzip.h
 @@ -119,11 +119,13 @@ extern int method; /* compression method */
array = (type*)fcalloc((size_t)(((size)+1L)/2), 2*sizeof(type)); \
if (!array) xalloc_die (); \
 }
 +#  error ZEROIFY needs an implementation, KiBi is lazy
  #  define FREE(array) {if (array != NULL) fcfree(array), array=NULL;}
  #else
  #  define EXTERN(type, array)  extern type array[]
  #  define DECLARE(type, array, size)  type array[size]
  #  define ALLOC(type, array, size)
 +#  define ZEROIFY(type, array, size) { for (int i=0; isize; i++) { array[i] 
 = 0; } }
  #  define FREE(array)
  #endif
  






--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2012-02-08 Thread Paul Eggert
Thanks very much for the patch.  But can someone who's looked into it
please explain why 'window' needs to be zeroed out?  This will save
me time in reviewing the patch.  Thanks.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2012-02-07 Thread Uli Martens
Hello there,

using a small tool I wrote a while back, it's possible to see that the
actuall differences are in the encoding of the last few bytes of
cleartext.

I've copied the decoding tool to http://youam.net/devel/rfc1952-dec

Attached is the difference of the decoded files from
20111208102738.ga7...@afflict.kos.to.

As to why this happens: no idea (yet)

Uli
--- NEWS.amd64.dec	2012-02-07 12:19:19.649431283 +0100
+++ NEWS.armel.dec	2012-02-07 12:19:26.853554306 +0100
@@ -3689,13 +3689,12 @@
  - lookback len 3 distance 3386
  - literal code 65 ('e')
  - literal code 72 ('r')
- - lookback len 5 distance 6517
+ - lookback len 5 distance 10888
  - eos
   skip:
 - 0
 - 0
 - 0
-- 0
   crc: f68d6c5a
   isize: 13696
 --- ''
--- ChangeLog.pre-2-2.amd64.dec	2012-02-07 12:17:54.787982067 +0100
+++ ChangeLog.pre-2-2.armel.dec	2012-02-07 12:18:02.960121626 +0100
@@ -11052,9 +11052,7 @@
  - literal code 63 ('c')
  - lookback len 12 distance 1661
  - lookback len 5 distance 3600
- - literal code 2e ('.')
- - literal code 0a
- - literal code 0a
+ - lookback len 3 distance 162
  - eos
   skip:
 - 0


Bug#647522: non-deterministic compression results with gzip -n9

2012-02-07 Thread Zack Weinberg
I've seen inexplicable nondeterminism like this before, and quite
often it's turned out to be controlled by the total size of the
command line argument area (that is, argv + environ + ELF auxv).
Changes in how big that is change the initial stack pointer address,
and while that *shouldn't* matter to anything, sometimes it does.

A shell loop of the form

export X=
i=0
while [ i -lt 8192 ]; do
perform test
X=${X}x
i=$((i+1))
done

should catch it if this is the cause.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2011-12-09 Thread Paul Eggert
On 12/08/11 02:27, Riku Voipio wrote:
 According to gzip RFC, the last 4 bytes are ISIZE, which should be
 uncompressed input size. Which leaves me rather baffled how that can
 differ on same input files - and how gunzip is completly happy with
 both versions of compressed file, producing the same output.

I looked only at NEWS.amd64.gz and NEWS.armel.gz.  For those two files
your diagnosis does not seem to be right, as these two files do not
differ in the last 4 bytes, but in bytes before then:

$ od -tx1 NEWS.amd64.gz NEWS.amd64.gz.od
$ od -tx1 NEWS.armel.gz NEWS.armel.gz.od
$ diff -u NEWS.*.gz.od
--- NEWS.amd64.gz.od2011-12-09 12:03:41.090594754 -0800
+++ NEWS.armel.gz.od2011-12-09 12:03:57.298663371 -0800
@@ -317,6 +317,6 @@
 0011700 fa 9f da 9b 92 ad 57 44 19 45 c5 42 e5 b6 d9 c2
 0011720 7e 80 02 bd 58 94 33 74 ba 0a 62 24 52 7b 35 33
 0011740 b2 87 51 76 b7 af cc 7f 09 b0 2d 14 d6 8d f9 4d
-0011760 94 51 39 49 cd 87 2e ff 0b 5a 6c 8d f6 80 35 00
+0011760 94 51 39 49 cd f7 50 ff 17 5a 6c 8d f6 80 35 00
 0012000 00
 0012001

So the differences are not in ISIZE.  Can you track down
what's actually differing and why?  (That would save some
time for me when debugging)  Thanks.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#647522: non-deterministic compression results with gzip -n9

2011-12-09 Thread Paul Eggert
I should add that it's OK (from the point of view of
the RFCs) if gzip produces different outputs given the same
inputs when compressing.  The RFCs allow that and presumably
other gzip implementations do that.  All that's required is
that compress+decompress result in a copy of the original.

That being said, it's nicer if gzip is deterministic and it'd
be helpful to to get to the bottom of this and make
the nondeterminism go away in future versions of gzip.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org