Public bug reported:

Binary package hint: apt

Hello! This is a feature request for Ubuntu Gutsy, though it can apply
to every dist that uses apt-get. I marked it as affecting apt-get, but I
think it would affect lots of other things if implemented.

AFAIK, every "apt-get update" checks each repository in turn and
downloads the package list from each one of those. For instance,
http://archive.ubuntu.com/ubuntu/dists/gutsy/main/binary-i386/Packages.bz2
is one that is used on my system. If the last file downloaded is the
unchanged on the server, this step is skipped, but otherwise the file is
re-downloaded completely.

This is inefficient for most repositories. The uncompressed Packages
file's contents is just a lot of text (6.3 MB last time I looked), and
only a small part of that changes for the vast majority of "apt-get
update" operations. It would be a very good candidate for diff-based
compression. Even if in its compressed form it takes only around a MB,
which is relatively small, it's still very inefficient to download it
each time.

I suggest an optional extension to the apt protocol should be added to
allow incremental updates: the server should keep in the same folder
with the Packages.* files a set of diffs calculated from the last few
versions of the Packages file. I suggest they be named "Packages-[hash]"
(with the gz/bz2 compressions), where "[hash]" is the hash of the
contents of the base file against which the diff is calculated. (This
would eliminate any version-number complexity.) For instance, if the
last three Packages files contained "1",  "2" and "3", and the current
file contains something else, the contents of the directory would be:

Packages.gz
Packages-c4ca4238a0b923820dcc509a6f75849b.diff.gz
Packages-c81e728d9d4c2f636f067f89cc14862c.diff.gz
Packages-eccbc87e4b5ce2fe28308fd9f2a7baf3.diff.gz

(I used MD5 here, we could use SHA256 or a better hash.) If the last
package list my apt downloaded was the one containing "2", apt would
hash it (it can do that when it's downloaded, anyway, for checking), see
that its contains hash to "c4ca4238a0b923820dcc509a6f75849b", and try to
download the corresponding file, if it finds it. The file would contain
the diff between the "2" package list and the latest version on the
server.

Because the diffs would be very small (and compressed), the server could
remember lots of them (for a couple of days or so) without a lot of
storage overhead.

The problem is updating the diffs each time the Packages file is changed. One 
way would be to simply remember the old versions of the Packages file, and 
re-calculate the diffs completely. That would be wasteful, however, both in 
space and in the amount of space it would take.
It could also keep only diffs between successive versions, but that's more 
complicated (apt would have to download the series, and apply them in turns).

There are also diff formats/algorithms that can be combined. (I think
that's what it's called.) This means that for versions 1, 2 and three of
a file, if you know diff1->2 and diff2->3 you can calculate diff1->3
directly, without looking at the complete files. If we used one of those
algorithms, for each update the server would need to (1) calculate the
diff between the last two versions and (2) combine that diff with the
other diffs it knows. (And probably delete the oldest.)

A nice thing is that it doesn't have to do this at once. It can move
everything to an "old" directory, put the new Package file in its place,
and then start adding the updated diffs. If anyone tries to do an update
in the mean-time, apt will look for the diff, won't find it, and it will
fall back to the downloading the complete package file. (Alternatively,
the server could put the diff calculation on hold, compute on-the-fly
the diff needed---each pair of diffs is easy to combine, it's just that
there's lots of them---send it, and resume with the others.) This also
means that the system will be perfectly backwards-compatible, because
old apt versions (or other programs) can just pick up the complete file.

Since many users have the Update Manager check for updates daily (it's a
built-in option), this would reduce part of the bandwidth used
significantly. It would also make manual updates speedier.

** Affects: apt (Ubuntu)
     Importance: Undecided
         Status: New

-- 
"package lists" should be downloaded as diffs (Contents-i386.gz)
https://bugs.launchpad.net/bugs/144083
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to