Public bug reported: Binary package hint: apt
Hello! This is a feature request for Ubuntu Gutsy, though it can apply to every dist that uses apt-get. I marked it as affecting apt-get, but I think it would affect lots of other things if implemented. AFAIK, every "apt-get update" checks each repository in turn and downloads the package list from each one of those. For instance, http://archive.ubuntu.com/ubuntu/dists/gutsy/main/binary-i386/Packages.bz2 is one that is used on my system. If the last file downloaded is the unchanged on the server, this step is skipped, but otherwise the file is re-downloaded completely. This is inefficient for most repositories. The uncompressed Packages file's contents is just a lot of text (6.3 MB last time I looked), and only a small part of that changes for the vast majority of "apt-get update" operations. It would be a very good candidate for diff-based compression. Even if in its compressed form it takes only around a MB, which is relatively small, it's still very inefficient to download it each time. I suggest an optional extension to the apt protocol should be added to allow incremental updates: the server should keep in the same folder with the Packages.* files a set of diffs calculated from the last few versions of the Packages file. I suggest they be named "Packages-[hash]" (with the gz/bz2 compressions), where "[hash]" is the hash of the contents of the base file against which the diff is calculated. (This would eliminate any version-number complexity.) For instance, if the last three Packages files contained "1", "2" and "3", and the current file contains something else, the contents of the directory would be: Packages.gz Packages-c4ca4238a0b923820dcc509a6f75849b.diff.gz Packages-c81e728d9d4c2f636f067f89cc14862c.diff.gz Packages-eccbc87e4b5ce2fe28308fd9f2a7baf3.diff.gz (I used MD5 here, we could use SHA256 or a better hash.) If the last package list my apt downloaded was the one containing "2", apt would hash it (it can do that when it's downloaded, anyway, for checking), see that its contains hash to "c4ca4238a0b923820dcc509a6f75849b", and try to download the corresponding file, if it finds it. The file would contain the diff between the "2" package list and the latest version on the server. Because the diffs would be very small (and compressed), the server could remember lots of them (for a couple of days or so) without a lot of storage overhead. The problem is updating the diffs each time the Packages file is changed. One way would be to simply remember the old versions of the Packages file, and re-calculate the diffs completely. That would be wasteful, however, both in space and in the amount of space it would take. It could also keep only diffs between successive versions, but that's more complicated (apt would have to download the series, and apply them in turns). There are also diff formats/algorithms that can be combined. (I think that's what it's called.) This means that for versions 1, 2 and three of a file, if you know diff1->2 and diff2->3 you can calculate diff1->3 directly, without looking at the complete files. If we used one of those algorithms, for each update the server would need to (1) calculate the diff between the last two versions and (2) combine that diff with the other diffs it knows. (And probably delete the oldest.) A nice thing is that it doesn't have to do this at once. It can move everything to an "old" directory, put the new Package file in its place, and then start adding the updated diffs. If anyone tries to do an update in the mean-time, apt will look for the diff, won't find it, and it will fall back to the downloading the complete package file. (Alternatively, the server could put the diff calculation on hold, compute on-the-fly the diff needed---each pair of diffs is easy to combine, it's just that there's lots of them---send it, and resume with the others.) This also means that the system will be perfectly backwards-compatible, because old apt versions (or other programs) can just pick up the complete file. Since many users have the Update Manager check for updates daily (it's a built-in option), this would reduce part of the bandwidth used significantly. It would also make manual updates speedier. ** Affects: apt (Ubuntu) Importance: Undecided Status: New -- "package lists" should be downloaded as diffs (Contents-i386.gz) https://bugs.launchpad.net/bugs/144083 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs