On Sun, Jan 30, 2022 at 05:35:37PM +0100, Johannes Schauer Marin Rodrigues wrote: > Quoting Colin Watson (2022-01-30 15:03:30) > > I'm a bit confused, because this seems to work at the wrong layer. > > Debian packages are supposed to preserve timestamps from the source > > package wherever possible, and failing that it would be possible to > > ensure that the timestamp of generated manual pages in binary packages > > is set to SOURCE_DATE_EPOCH. Flattening timestamps to an epoch at mandb > > time seems like the wrong place for this at first inspection, and I'd > > like some clearer rationale for why you ended up with this approach. > > > > I would suggest instead ensuring that mtimes of manual pages are > > reproducible, after which mandb should produce reproducible databases > > (and if it doesn't I'd consider that a bug). > > > > Deliberately setting database timestamps that don't match the filesystem > > will confuse mandb into doing unnecessary work in later runs, so I don't > > like > > this approach. > > My reasoning was, that tools that care about reproducible index.db will > "flatten" the mtimes to SOURCE_DATE_EPOCH in the tarball or image they > produce, > so setting the timestamp in index.db to SOURCE_DATE_EPOCH for those timestamps > larger than SOURCE_DATE_EPOCH seemed like the approach that would result in a > consistent overall state.
It might not be entirely impossible, but I'd really prefer not to break the link between filesystem timestamps and database timestamps if we can avoid it. I know your implementation is more or less like "tar --mtime=DATE --clamp-time", but the difference is that tar doesn't also compare filesystem timestamps against the archive later. > But if that's the wrong approach, lets think of the alternative: making sure > that the mtimes of manual pages is reproducible. If I use gdbm_dump on the > index.db of two different chroots, then it looks like the following manual > pages > have differing timestamps: > > bash-builtins, which, dash, mawk, pager, awk, sh, more, nawk, builtins > > Most of those seem to be symlinks into /etc/alternatives and those symlinks > get > created by maintainer scripts using update-alternatives. Are you suggesting > that update-alternatives should gain support for setting the mtime of the > files > it creates to SOURCE_DATE_EPOCH? I think that would at least be worth considering. It doesn't seem any less obvious a thing to do for reproducible installs than hacking mandb would be, and it would deal with the problem closer to its source: for instance, it would get you closer to being able to produce bitwise-identical reproducible images by e.g. tarring up the filesystem, which would preserve filesystem mtimes in the image. (Though I guess --clamp-mtime deals with that, but maybe not all image archiving tools have something like that?) Another approach might be to modify filesystem timestamps after postinsts have finished running but before mandb runs to clamp timestamps to SOURCE_DATE_EPOCH; a bit like your proposed patch, but actually modifying the filesystem timestamps as well. I'm not sure where that could go, though. It can't be in mandb because the postinst deliberately doesn't run mandb as root; and of course mandb is itself run from a postinst. Maybe some kind of dpkg hook, or maybe it would be simplest to just run a post-processing step that clamps all the filesystem timestamps and then runs the equivalent of "sudo -u man mandb -cq"? (This might be more palatable with man-db 2.10.0, where this will take more like 10 seconds rather than several minutes; see #1003089.) > I'm puzzled by bash-builtins though because that one is not a symlink. So I > don't understand why the timestamp differs there. This puzzled me for a while too, but it's because /usr/share/man/man7/builtins.7.gz is a symlink created by update-alternatives and references bash-builtins in its NAME, which provoked https://bugs.debian.org/691643. I've now fixed that upstream: https://gitlab.com/cjwatson/man-db/-/commit/37ab864354c1d0ac09e27d2346a1221bf4628509 This may cause your comparisons to show more differences, but it should mean that they're more reliably the *same* differences. Previously, the behaviour depended on directory iteration order (actually usually the location of the first physical extent of each file on disk, since mandb sorts by that for improved performance on rotational disk drives). -- Colin Watson (he/him) [cjwat...@debian.org]