On Sun, 2002-04-07 at 11:16, Otto Wyss wrote: > What amazes me is that nobody is able or willing to provide any figures. > So I guess no provider of an rsync server is interested in this subject > and therefore it can't be a big problem.
Here are some experiments, and a mathematical analysis of different approaches. The only missing piece of data that I had to fudge is: how often do people run apt-get update? If anyone can give me server logs containing If-Modified-Since fields, that would be great. Quick Summary: ------------------ Diffs compressed with bzip2 generated the smallest difference files, and hence the smallest downloads. Using the scheme described below, I estimate that mirrors retaining 20 days worth of diffs will need about 159K of disk space per Packages file and the average 'apt-get update' will transfer about 24.2K per changed Packages file. xdelta would have slightly higher bandwidth and disk space requirements, but would be applicable to binary files (such as debs). rsync has no disk space requirements, but uses 10 times as much bandwidth and requires more memory, cpu power, etc, on the server. rsync also has the advantage of already being implemented, but managing a series of diffs seems like a trivial shell script. So in my opinion, diff/bzip2 or xdelta looks like the way. Example: Diffs between unstable main Packages file from Apr 6 and 7: ----------------------- diff+bzip2: 12987 bytes diff+gzip: 13890 bytes xdelta: 15176 bytes rsync: 163989 bytes (*) (*) rsyncing uncompressed Packages files with rsync -a --no-whole-file --stats -z -e ssh The Scheme (proposed earlier by others) ---------- Assuming debian admins tend to update relatively frequently, the following diff-based scheme seems to offer the best compromise on mirror disk space and download bandwidth: For the 20 most recent Packages files, the server stores the diff between each pair of consecutive Packages files. apt-get then simply does: do { m = md5sum of my Packages file d = fetch Packages_diff_${md5}.bz2 if (d does not exist) { fetch Packages.gz break; } patch my Packages file with d } while (d is not an empty file) This scheme easily allows mirrors to tweak the parameters to best suite their own disk space and bandwidth limitations, and they are not required to have any cgi-scripts or extra services running. For example, a mirror that's tight on disk space can just delete some of the older diffs, but it will incur a slight bandwidth penalty as a result. The only disadvantage to using diffs (compared to rsync or some other dynamic scheme) is the additional disk space requirement. The disk space requirement is very small, and disk space is cheaper than cpu time, memory, and bandwidth. Analysis: --------- The anlysis uses gp, a great math tool that's available in debian. I. Diff vs. xdelta ------------------ By looking at debian-devel-changes, I figured that between Feb. 1 and April 1, an average of 75 packages were uploaded each day. There are around 8000 packages listed in testing main, so the probability that any given package changes on any given day is p=75/8000. Thus the expected number of packages that change in s days is (1-(1-p)^s)*8000. For example, the expected number of changed packages in 60 days is 3453. Comparing the Packages files from Feb. 7 and April 7 shows 3884 changed packages, so the model seems reasonably accurate. My experiments with diff, xdelta, bzip2, etc. concluded that if you diff two Packages files with 75 changed packages between them, and then compress the diff with bzip2 -9, the resulting file is about 7936 bytes, or roughly 106 bytes per changed package. Thus the average size of a compressed diff between Packages files seperated by s days is diffsize(s)=(1-(1-p)^s)*8000*106 The xdelta of the same files is about 25% larger. If this scheme is extended to all deb files, not just Packages files, it may just be more convenient to use xdelta, though. II. Successive diffs vs. all-at-once diffs ------------------------------------------ This analysis applies to either diff or xdelta; it doesn't matter. A. Disk space The next question is, should we diff consecutive Packages files, or should we compute diffs between the last 20 Packages files and today's Packages file? The latter will allow apt-get to fetch just one diff and be done with it. However, it uses more disk space on the mirrors. The former may require apt-get to fetch several patches in order to update its Packages file, but will use less disk space on the mirrors. There is actually a spectrum of choices here. A mirror may store diffs between every 3, 4, 5, etc. Packages files. So, if a client has the Packages file from 14 days ago, it will first be given a patch bringing its Packages file to 9 days ago, then 4, and then to the current Packages file. If a server stores diffs between Packages files seperated by s days, and stores d days back, then it will need dspace(s,d)=sum(x=1,s,diffsize(x))+(d-s)*diffsize(s) bytes of disk space. So, for example, to store 20 days of consecutive patches, the server needs 159K. For 20 days of all-at-once diffs, it needs 1.6M. For 5-days-at-a-time diffs, it needs 702K. B. Bandwidth Consecutive patches save disk space, but clients have to download more patches, requiring more bandwidth. If a user performs an update after having gone i days without an update, then the amount of bandwidth she uses is bwidth(s,d,i)=if(i<=s,1+diffsize(i),if(i>d,npkgs*200,1+diffsize(s)+bwidth(s,d,i-s))) (I charge the user 1K overhead for each seperate file she has to download.) So in order to figure out how much bandwidth each scheme requires on average, we need a model of how frequently users perform apt-get update. I chose a simple geometric model, where users update every i days with probability (2/3)^i / 2. I would be very grateful if some mirror maintainers could get me some more accurate statistics on this. Anyway, using this crude model, the expected bandwidth required by any user is ebwidth(s,d)=sum(x=1,d,prob(x)*bwidth(s,d,x))+(1-sum(x=1,d,prob(x)))*bwidth(s,d,d+1) So for the consecutive-patches scheme, the average bandwidth used by each user is 24.2K. For the all-at-once patch scheme, the average bandwidth is 23.8K. This is a miniscule bandwidth reduction at the cost of explosive growth in disk space, so the consecutive patches scheme seems best. This analysis really depends on the model for how frequently users perform apt-get update. If I got that wrong, then the numbers are way off. So please, if any mirror maintainers can send me logs of the If-Modified-Since fields in Packages GET requests, that would be very helpful. Thanks. III. Rsync ---------- I used rsync to synchronize the unstable main Packages files from April 6 and April 7. There were a large number of changed entries between these two: about 190 changes. The bzip2 compressed diff was under 13K. Using 'rsync -a --no-whole-file --stats -z -e ssh' to synchronize the uncompressed Packages files, the total amount of traffic was 164K. Clearly, rsync is much less efficient for this application. IV. Other factors ------------------ Since rsync negotiates the diff on the fly, the amount of bandwidth it uses will increase both as the number of packages in debian grows and as the number of changing packages in debian grows. The size of diff depends only on the number of changed packages, not on the size of the Packages file, so its advantage should increase over time. The consecutive patches scheme requires producing only one new patch for every Packages file, where-as the all-at-once scheme requires producing 30 new patches, so mirroring will be easier in the consecutive patch scheme. Mirrors can then merge these patches if they wish to tweak the setup for their bandwidth and disk space constraints. The diff scheme requires no new services to run on the mirrors. No cgi-scripts, no rsync, nothing. And it works over any transport which already works with apt-get. Some have suggested splitting up the Packages file. I haven't done any experiments on this approach, but my guess is that fetching lots of little files will incur way too much overhead. Afterall, 24K is hard to beat. If debian started producing xdeltas between successive versions of deb files, then the bandwidth demand on mirrors could probably be reduced by a factor of 50. Not as many diffs are required for debs, since they change more slowly than the Packages file, so I estimate that storing diffs for all the packages in debian would increase the disk space required on a mirror by less than 5% (probably much less). A 50 fold cut in bandwidth for a 5% increase in disk usage is probably a win. Finally, the size of the diffs are really small. For example, including mozilla in debian is a much greater burden on mirror sites than adopting this patch policy would be, but nobody worries about mirror disk space when adding new software to debian. Thanks for the greatest operating system on earth. I hope my experiments and analysis are helpful. Please let me know if you have any questions. Best, Rob P.S. I know this has to wait till after the release. I just thought it was a cool thread. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]