Hi, >From time to time the question arises on different forums whether it is possible to efficiently use rsync with apt-get. Recently there has been a thread here on debian-devel and it was also mentioned in Debian Weekly News June 24th, 2003. However, I only saw different small parts of a huge and complex problem set discussed at different places, I haven't find an overview of the whole situation anywhere.
Being one of the developers of the Hungarian distribution ``UHU-Linux'' I spent some time in the last few days by collecting as much information as possible, putting the patches together and coding a little bit to fill some minor gaps. Here I'd like to summarize and share all my experiences. Our distribution uses dpkg/apt for package management. We are not Debian based though, even our build procedure which leads to deb packages is completely different from Debian's, except for the last step (which is obviously a ``dpkg-deb --build''). Some of our resources are quite tight. This is especially true for the bandwidth of the home machines of the developers and testers. Most of us live behind a 384kbit/s ADSL line. From time to time we rebuild all our packages to see if they still compile with our current packages. Such a full rebuild produces 1.5GB of new packages, all with a new filename since the release numbers are automatically bumped. Our goal was to reach that an upgrade after such a full rebuild requires only a reasonable amount of network traffic instead of one and a half gigabytes. Before telling how we succeeded in it I'd like to demonstrate the result. One of my favorite games is quadra. The size of the package is nearly 3MB. I've purged it from my system and then performed an ``apt-get install quadra''. Apt-get printed this, amongst others: Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.8 [2931kB] Fetched 2931kB in 59s (49,0kB/s) The download speed and time corresponds to the 384kbit/s bandwidth. I recompiled the package on the server. Then I typed ``apt-get update'' followed by ``apt-get install quadra'' again. This time apt-get printed this: Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.9 [2931kB] Fetched 2931kB in 3s (788kB/s) Yes, downloading only took three seconds instead of one minute. Obviously these two files do not only differ in their filename, they contain their release number, timestamps of files and perhaps other pieces of data which make them different. Needless to say that a small change in the package would only slightly increase the download time. Speedup is usually approx. 2x--3x for packages containing lots of small files, but can be extremely high for packages containing bigger files. The rest of my mail tells the implementation details. rsyncable gzip files -------------------- A small change in a file causes their gzipped version to get out of sync and hence rsync doesn't see any common parts in them. There's a patch by Rusty Russell floating around on the net which adds an --rsyncable option to gzip. It is already included in Debian. This way gzipped files have synchronization points making rsync's job much easier. The patch is available (amongst others) at [1a] and [1b]. The documentation in the original patch says ``This reduces compression by about 1 percent most cases''. Debian's version says ``This increases size by less than 1 percent most cases''. Size increasement was 0.7% for all our packages, but 1.2% for our most important packages (the core distrib in about 300--400MB). This 1% is very low if you think of it as 1%. If you think of it as you lose 6MB on every CD, then, well, it could have been smaller. But if you think of what you gain with it, then it is definitely worth it. The same patch also exists for zlib (see [2a] or [2b]). However as for gzip you can control this behaviour with a command line option, it is not so trivial to do it with a library. The official patch disables rsyncable support by default. You can enable it by changing "zlib_rsync = 0" to "zlib_rsync = 1" within zlib's source or you can control it from your running application. As I didn't like these approaches, I added a small patch so that setting the ZLIB_RSYNC environment variable turns on the rsyncable support. This patch is at [3]. As dpkg seems to statically link against zlib, we had to recompile dpkg after installing this patched zlib. After this we changed our build script so that it invokes ``dpkg-deb --build'' with the ZLIB_RSYNC environment variable set to some value. order of files -------------- dpkg-deb puts the files in the .deb package in random order. I hate this misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which makes the files appear in the package in alphabetical order. I don't know how efficient rsync is if you split a file to some dozens or even hundreds of parts and shuffle them, and then syncronize this one with the original version. Anyway, I'm sure that sorting the files cannot hurt rsync, it can only help. I only guess that it really does help a lot. similar filenames in rsync -------------------------- Whenever we rebuild a package, it gets different filename, as the release number is increased. If a file has different name, it is a completely different file in rsync's eyes. There's a patch for rsync (yet again by Rusty) which adds support for fuzzy filenames: when downloading a file, it is merged to the local file with the most similar filename. This patch is available inside the official rsync 2.5.6 tarball or at [5], however, it only applies to rsync 2.5.4. Unfortunately I was unable to port this patch to 2.5.6 in a reasonable time so we have an rsync 2.5.6 package without fuzzy support, and an rsync-fuzzy 2.5.4 package. rsync method in apt ------------------- Sviatoslav Sviridoff created a patch for apt which adds rsync support (rsync needs to be patched, too). See it at [6]. It cleanly applies to apt 0.5.5.1. Decoded versions of these base64 patches are also available at [7] (for apt), [8a] (for rsync, ported to 2.5.4) or [8b] (for rsync, ported to 2.5.6) and [9] (for rsync versions up to 2.5.5, it's already included in rsync 2.5.6). the gap ------- Sviatoslav's patch makes apt use rsync, but it has nothing to do with similar filenames, it downloads the files from scratch. Hence it is useful to replace brain-damaged FTP by a sane protocol, however, it cannot save network traffic on its own. Apt asks its method helper binary (http, ftp, rsync...) to download the files into a temporary directory (/var/cache/apt/archives/partial) and later moves the files to their final place (/var/cache/apt/archives). However, rsync --fuzzy only looks for similar filenames in the directory where the new file is downloaded to. The solution would be to use the --compare-dest option of rsync if it worked the way I expect it to work. However, it works differently, see [10] for details. To fill this gap I created a quick&ugly patch for rsync 2.5.4 which introduces a --compare-fuzzy-dest option which does what we need for apt. Get it from [11]. Furthermore, apt also needs a minor patch to call rsync with the new options [12]. (This patch is ugly since it contains a hard-coded path (/var/cache/apt/archives). It also renames the default executable to rsync-fuzzy, which might not be what you want.) conclusion ---------- The good news is that it is working perfectly. The bad news is that you can't hack it on your home computer as long as your distribution doesn't provide rsync-friendly packages. Maybe one could set up a public rsync server with high bandwidth that keeps syncing the official packages and repacks them with rsync-friendly gzip/zlib and sorting the files. cheers, Egmont Ps. Please CC me if you reply, I'm not subscribed. [1a] http://ozlabs.org/~rusty/gzip.rsync.patch2 [1b] https://svn.uhulinux.hu/packages/dev/gzip/patches/01-rsync.patch [2a] http://moin.conectiva.com.br/files/CompressedRsync/attachments/zlib-1.1.4-rsync.patch [2b] https://svn.uhulinux.hu/packages/dev/zlib/patches/02-rsync.patch [3] https://svn.uhulinux.hu/packages/dev/zlib/patches/03-rsync-from-env.patch [4a] http://freshmeat.net/projects/sortdir/ [4b] ftp://ftp.uhulinux.hu/pub/sources/sortdir/sortdir-0.3.1.tar.gz [5] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/02-fuzzy.patch [6] http://distro2.conectiva.com.br/pipermail/apt-rpm/2003-January/001085.html [7] https://svn.uhulinux.hu/packages/dev/apt/patches/03-rsync-method.patch [8a] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/04-apt-support.patch [8b] https://svn.uhulinux.hu/packages/dev/rsync/patches/02-apt-support.patch [9] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/03-cleanup.patch [10] http://lists.samba.org/pipermail/rsync/2003-July/011209.html [11] https://svn.uhulinux.hu/packages/dev/rsync-fuzzy/patches/05-compare-fuzzy-dest.patch [12] https://svn.uhulinux.hu/packages/dev/apt/patches/04-rsync-method-fuzzy.patch If you can't find a file under https://svn.uhulinux.hu/ then try to list directories and take a look at other files. If the directory ``rsync-fuzzy'' doesn't exist then it means I've managed to port the fuzzy patch to 2.5.6 and hence look for them under the directory ``rsync''.