Tags: patch
I've also observed this problem, that Jigdo ignores the cache and re-
scans all the files every time. I did some debugging and tracked it
down to a one-line mistake that's easy to fix:
diff --git a/src/scan.cc b/src/scan.cc
index 9ce598e..b031680 100644
--- a/src/scan.cc
+++ b/src/scan.cc
@@ -109,7 +109,7 @@ size_t FilePart::unserializeCacheEntry(const Ubyte* data,
size_t dataSize,
Paranoid(serialSizeOf(md5Sum) == 16);
Paranoid(serialSizeOf(sha256Sum) == 32);
// All blocks of file present?
- if (blocks == MD5sums.size() + SHA256sums.size()) {
+ if (blocks == MD5sums.size() && blocks == SHA256sums.size()) {
setFlag(MD_VALID);
data = unserialize(md5Sum, data);
data = unserialize(sha256Sum, data);
A cache entry contains hashes of individual 1k blocks of the file, and
this code is checking that the entry contains the expected number of
them. The number of blocks is simply the file's size divided by the
block size (1k), rounded up, and the cache entry should contain that
many MD5 block hashes, followed by the same number of SHA256 block
hashes. The "blocks" variable is the length of *each* of the two block
lists in the entry (since they're always equal-length), not their sum.
So, the bug is that the deserialization code thought the cached hash
data was invalid, because it expected the wrong number of blocks.
Using this patch, I upgraded my set of Debian 11.3 DVD images to 11.4,
and it behaved as expected: scanned all the input files (the union of
all the 11.3 DVD contents) in the first run, to produce the first
output file, and then re-used the cached hashes in later runs to
produce the other files.
(The rest of the function's code is a little confusing, btw: if the
number of blocks doesn't match, it clears the MD_VALID flag and ignores
the whole-file hashes, but then proceeds to deserialize the block
hashes anyway, in a way that'll go off the end of the MD5sums and
SHA256sums vectors if they're not long enough. But there's a debug
assertion earlier that MD5sums and SHA256sums are sized correctly based
on the file size, and a cache entry *should* always have the correct
number of block hashes based on the file size, so I don't think it's a
problem in practice. Brittle code, though.)