The timestamp stored in the gzip file header results in a non-deterministic
output even when the input is identical. Many files (e.g. under indices/)
are frequently regenerated with identical contents but their compressed
versions end up slightly different. This unnecessarily inflates the number
of unique file hashes that snapshot.debian.org has to deal with, for
example. It may also make mirror updates less efficient.

I encountered this when I tried to find when certain changes were made by
comparing the checksum of indices/files/components/suite-stable.list.gz and
found that it changes on every update. This obviously applies to many other
compressed files.

The quick fix is to add "--no-name" to the gzip command (or GZIP=-n to the
environment). A better fix would be to generate a temporary file, compare
it to the current file and replace the file only if not identical. This
will preserve the timestamp of the original file and should help some
mirroring protocols.

Oren

Reply via email to