On Wed, 2002-04-10 at 04:35, Michael Bramer wrote: > > Scheme Disk space Bandwidth > > ----------------------------------------------------------- > > Checksums (bwidth optimal) 26K 81K > > diffs (4 days) 32K 331K > > diffs (9 days) 71K 66K > > diffs (20 days) 159K 27K > > can you explain your counts?
Sure. At the end of this message is a script that you can use with the program "gp" to recreate my numbers. Here's a quick description: Anthony Towns said that the average size of a diff between two Packages files is 12K (after compression with bzip2). So if the server keeps d days of diffs, this will take about d*12K of disk space. If I go for i days without updating, then when I do update, if i <= d, then I will need to download i diff files, using about i*(12K + 1K) bandwidth. (The 1K is for each GET request, since I'm downloading i files). If i > d, then I need to get the whole Packages.gz file, which I estimate as about 1.6M. So let bwidth(d,i) = the amount of bandwidth used doing an update after i days, where the server has kept d days of diffs. So how much bandwidth is used _on average_? Well, it depends on how often everybody updates. If everybody updates everyday, then everybody would just need to download 1 diff, using 13K. If everybody updates every week, then the average bandwidth is 7*13K=91K. In reality, we don't know how often people update, but my guess is that people tend to update often. So I just guessed that the probability that someone updates after i days is prob(i)=((2/3)^i)/2. Why this formula? It seemed good at the time. So then the average bandwidth used is average_bwidth(d)=sum i=1,...,infinity of bwidth(d,i) * prob (i) That's it for the diff stuff. For the checksum scheme, the disk space required is the number of checksums times the size of each checksum. The number of checksums is the size of Packages.gz divided by the block size. Since the checksum file has to be transferred to every client, the size of the checksum file contributes to the bandwidth estimate, as well. Additionally, I estimate that 75 packages change in debian every day (derived by looking at debian-devel-changes in Feb. and March). Using a little probability, I computed the average number of blocks in Packages.gz that will change in i days. I then estimate that each of these blocks will have to be transferred during an update, and use that to estimate the amount of bandwidth required for an update. Then I average, as with the diff scheme. Let me know if you think there's any problems with this.. I've been playing around recently with a more realistic (in my opinion) user model. I now predict that the probability that a user will update every i days is n/(i+1)^3 (n is a normalization factor). I like this model because it predicts that if a user hasn't updated in a long time, it'll probably be a long time before they update. This seems intuitively correct to me. So here's some new numbers comparing the diff scheme and the rsync scheme in this new user model. In my opinion, diff still wins. These numbers use prob(i)=(n/(i+1)^3)/i, so these numbers are the average bandwidth, averaged over what the server sees. For an explanation of this, see http://lists.debian.org/debian-devel/2002/debian-devel-200204/msg01076.html. Diffs: days dspace ebwidth ------------------------------- 1 12.000K 296.80K 2 24.000K 110.70K 3 36.000K 58.800K 4 48.000K 39.100K 5 60.000K 30.000K 6 72.000K 25.300K 7 84.000K 22.600K 8 96.000K 21.000K 9 108.00K 19.900K 10 120.00K 19.200K 11 132.00K 18.700K 12 144.00K 18.400K 13 156.00K 18.100K 14 168.00K 17.900K 15 180.00K 17.800K Checksum files: bsize dspace ebwidth ------------------------------- 20 312.50K 315.40K 40 156.30K 161.10K 60 104.20K 111.00K 80 78.100K 86.800K 100 62.500K 73.100K 120 52.100K 64.600K 140 44.600K 59.100K 160 39.100K 55.500K 180 34.700K 53.000K 200 31.300K 51.500K 220 28.400K 50.500K 240 26.000K 50.100K 260 24.000K 50.000K 280 22.300K 50.100K 300 20.800K 50.500K 320 19.500K 51.100K 340 18.400K 51.800K 360 17.400K 52.700K 380 16.400K 53.700K 400 15.600K 54.700K Best, Rob /*********************** * Info about debian ***********************/ /* The number of packages in debian */ npkgs=8000.0 /* How big is Packages[.gz] as a function of the number of packages */ compressed_bytes_per_pkg=200.0 uncompressed_bytes_per_pkg=800.0 Packagesgz_size=npkgs*compressed_bytes_per_pkg Packages_size=npkgs*uncompressed_bytes_per_pkg /* The average number of packages that change each day */ changes_per_day=75.0 /* The probability that a given package will change on a given day. */ p=changes_per_day/npkgs /***************************** * The probability that a given GET request has a If-Modified-Since * field that is i days old. *****************************/ /* My lame approximation */ /* tprob(i)=1.0/(i+1)^3 */ tprob(i)=(2.0/3)^i /* A guess derived from the survey on debianplanet.org */ /* tprob(x)=1/x^1.25 */ /* Normalize the probability defined above */ tpn=sum(i=1,100000,tprob(i)) prob(i)=tprob(i)/tpn /************************** * Info about http/ftp queries **************************/ /* The overhead of requesting a file download. This is definitely an overestimate. */ query_size=1000.0 /*************************** * utility functions ***************************/ /* Convert bytes to KB, rounded */ toK(x)=round(10.0*x/1024)/10.0 /*************************** * Diff method ***************************/ /* From Anthony Towns: the average size of a bzip2 compressed diff -ed between two successive Packages files is 12K. */ diff_bytes_per_change=12.0*1024/changes_per_day /* Expected size of a compressed diff between two Packages files seperated by s days. */ diff_size(s) = (1-(1-p)^s)*npkgs*diff_bytes_per_change /* Amount of diskspace required to keep d days of diffs, each diff spanning s days. */ diff_diskspace(s, d) = sum(x=1,s,diff_size(x))+(d-s)*diff_size(s) /* Amount of bandwidth used by a user performing an 'apt-get update' after i days */ diff_bwidth(s, d, i) = if (i<=s, query_size + diff_size(i), \ if (i>d, query_size + Packagesgz_size, \ query_size + diff_size(s) + diff_bwidth(s,d,i-s))) /* Average amount of bandwidth used by a user */ diff_ebwidth(s, d) = sum(x=1,d,prob(x)*diff_bwidth(s,d,x))+sum(x=d+1,180,prob(x))*diff_bwidth(s,d,d+1) /* Display a table of the average bandwidth used for varying numbers of days */ diff_table(s,m) = print ("days\tdspace\t\tebwidth"); \ print ("-------------------------------"); \ for (d=1, m, printp (d,"\t",toK(diff_diskspace(s,d)),"K\t\t", \ toK(diff_ebwidth(s,d)),"K")) /************************** * xdelta method **************************/ /* Basically the same as the diff method, but with a slightly larger bytes_per_change */ /************************** * checksums method **************************/ /* Number of blocks. */ csum_nblocks(bs)=Packagesgz_size/bs /* The size of the checksum file. */ csum_diskspace(cs,bs) = csum_nblocks(bs)*cs /* The average number of package descriptions that will live in one block */ csum_pkgs_per_block(bs)=bs/compressed_bytes_per_pkg /* The probability that a given block changes after s days. */ csum_block_prob_changed(bs,s)=1-(1-p)^(s*csum_pkgs_per_block(bs)) /* The expected number of changed blocks after s days */ csum_changed_blocks(bs,s)=csum_block_prob_changed(bs,s)*csum_nblocks(bs) /* The expected amount of bandwidth used by a user doing an update after i days. */ csum_bwidth(cs,bs,i)=query_size + csum_diskspace(cs,bs) + csum_changed_blocks(bs,i) * bs /* The average amount of bandwidth used by a user. */ csum_ebwidth(cs,bs)=sum(x=1,180,prob(x)*csum_bwidth(cs,bs,x)) csum_table(cs,minbs,maxbs,step)=print ("bsize\tdspace\t\tebwidth"); \ print ("-------------------------------"); \ forstep(bs=minbs, maxbs, step, \ printp (bs,"\t", toK(csum_diskspace(cs,bs)), "K\t\t", \ toK(csum_ebwidth(cs,bs)),"K")) /************************** * rsync method **************************/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]