On 24/11/22 04:51, Andrew Gregory wrote:
On 11/23/22 at 10:27pm, Allan McRae wrote:
The idea of package deltas just won't go away...  However, binary diffs
really are not ideal with pacman verifying the compressed package - that
means we need to reconstruct the package on the users system to verify. Also
our old approach using xdelta3 somewhat died when moving packages away from
gz (or xz?) compression.  Other binary diff approaches really suffered the
same issue.  In general, I find the approach of reconstructing the full
package to be suboptimal.  I also don't particulaly want to verify
uncompressed packages.


I wondered if this was a case of perfect being the enemy of good, so I have
investigated a different, very lazy approach.  Instead of taking a binary
diff, we could just provide the files that have changed between package
versions.  This is super easy to do as we have checksums for all files in
the mtree file.  We could then extract this "diff" package directly, and use
the mtree file to adjust timestamps/permissions/etc(?) on kept files, and it
would be just like the full package had been installed.

As I understand your intended approach, operations using a diff package would
be fundamentally different than those involving a full package.  Files changed
on the system but unchanged in the package would not be restored.  Once
upgraded, the cached diff package would be useless for
reinstallation/downgrading without downgrading to the previous version first
then upgrading again using the diff. `pacman -S foo` to reinstall would no
longer work without downloading the full package.

It's not ideal, but I think those are reasonable caveats.  People generally
shouldn't be messing with non-backup files anyway and as long as they manage
their cache properly, reinstallation and downgrading using the cache are still
possible.

That is correct. I thought about special casing "pacman -S pkg" when it is a reinstall to always use the full package. But maybe a flag is better.

I ran some numbers to see if this was worth while.  The results for the last
bunch of updates for bash, coreutils, qt5-base and systemd are given here:
https://wiki.archlinux.org/title/User:Allan/Pkgdiff

On major version updates, this is approach is a waste of time.  But for
minor updates bash download would average 25% of the size, coreutils about
36% (though was ~1% for simple rebuilds!), qt5-base about 40% and systemd
60%.  Not shown but worth noting note that when Arch changes gcc/binutils
versions or updates CFLAGS etc, this can stop any binary diff being as
useful.


If we implemented using these diffs but only allowed it for updates from the
previous package version (i.e. no diffs to package (current - 2) or earlier,
or diff chaining), then this would be rather simple to implement (at least
from the pacman side...).

I agree with no diff chaining; keeping them as separate partial packages
instead of reconstructing a full package would make chaining a little
complicated.  I'm not sure about the previous-version-only rule though.  The db
is going to have to know the base version for the partial package either way,
so the cost of supporting multiple bases seems low as far as we're concerned;
just a simple search through the available partial files for one based on the
currently installed version.

This was a thought to keep the database from ballooning in size. I guess the minimum per pkgdiff is a filename and sha256sum. Probably signature too if --include-sigs is used. Also, hosting the pkgdiffs would get big if many versions past were used - I guess that is a distro problem.

In practise, I guess distros only providing pkgdiffs up to a threshold of a packages size, and for a given update window would remove this as an issue.

Allan

Reply via email to