On Wed, 2002-04-10 at 04:35, Michael Bramer wrote:
> > Scheme                         Disk space         Bandwidth
> > -----------------------------------------------------------
> > Checksums (bwidth optimal)            26K               81K
> > diffs (4 days)                        32K              331K
> > diffs (9 days)                        71K               66K
> > diffs (20 days)                      159K               27K
> 
> can you explain your counts?

Sure.  At the end of this message is a script that you can use with the
program "gp" to recreate my numbers.  Here's a quick description:

Anthony Towns said that the average size of a diff between two Packages
files is 12K (after compression with bzip2).  So if the server keeps d
days of diffs, this will take about d*12K of disk space.  If I go for i
days without updating, then when I do update, if i <= d, then I will
need to download i diff files, using about i*(12K + 1K) bandwidth.  (The
1K is for each GET request, since I'm downloading i files).  If i > d,
then I need to get the whole Packages.gz file, which I estimate as about
1.6M.  So let bwidth(d,i) = the amount of bandwidth used doing an update
after i days, where the server has kept d days of diffs.

So how much bandwidth is used _on average_?  Well, it depends on how
often everybody updates.  If everybody updates everyday, then everybody
would just need to download 1 diff, using 13K.  If everybody updates
every week, then the average bandwidth is 7*13K=91K.  In reality, we
don't know how often people update, but my guess is that people tend to
update often.  So I just guessed that the probability that someone
updates after i days is prob(i)=((2/3)^i)/2.  Why this formula?  It
seemed good at the time.  So then the average bandwidth used is

average_bwidth(d)=sum i=1,...,infinity of bwidth(d,i) * prob (i)

That's it for the diff stuff.

For the checksum scheme, the disk space required is the number of
checksums times the size of each checksum.  The number of checksums is
the size of Packages.gz divided by the block size.  Since the checksum
file has to be transferred to every client, the size of the checksum
file contributes to the bandwidth estimate, as well.  Additionally, I
estimate that 75 packages change in debian every day (derived by looking
at debian-devel-changes in Feb. and March).  Using a little probability,
I computed the average number of blocks in Packages.gz that will change
in i days.  I then estimate that each of these blocks will have to be
transferred during an update, and use that to estimate the amount of
bandwidth required for an update.  Then I average, as with the diff
scheme.  Let me know if you think there's any problems with this..

I've been playing around recently with a more realistic (in my opinion)
user model.  I now predict that the probability that a user will update
every i days is n/(i+1)^3 (n is a normalization factor).  I like this
model because it predicts that if a user hasn't updated in a long time,
it'll probably be a long time before they update.  This seems
intuitively correct to me.  So here's some new numbers comparing the
diff scheme and the rsync scheme in this new user model.  In my opinion,
diff still wins.

These numbers use prob(i)=(n/(i+1)^3)/i, so these numbers are the
average bandwidth, averaged over what the server sees.  For an
explanation of this, see
http://lists.debian.org/debian-devel/2002/debian-devel-200204/msg01076.html.

Diffs:
days    dspace          ebwidth
-------------------------------
1       12.000K         296.80K
2       24.000K         110.70K
3       36.000K         58.800K
4       48.000K         39.100K
5       60.000K         30.000K
6       72.000K         25.300K
7       84.000K         22.600K
8       96.000K         21.000K
9       108.00K         19.900K
10      120.00K         19.200K
11      132.00K         18.700K
12      144.00K         18.400K
13      156.00K         18.100K
14      168.00K         17.900K
15      180.00K         17.800K

Checksum files:
bsize   dspace          ebwidth
-------------------------------
20      312.50K         315.40K
40      156.30K         161.10K
60      104.20K         111.00K
80      78.100K         86.800K
100     62.500K         73.100K
120     52.100K         64.600K
140     44.600K         59.100K
160     39.100K         55.500K
180     34.700K         53.000K
200     31.300K         51.500K
220     28.400K         50.500K
240     26.000K         50.100K
260     24.000K         50.000K
280     22.300K         50.100K
300     20.800K         50.500K
320     19.500K         51.100K
340     18.400K         51.800K
360     17.400K         52.700K
380     16.400K         53.700K
400     15.600K         54.700K


Best,
Rob

/***********************
 * Info about debian
 ***********************/

/* The number of packages in debian */
npkgs=8000.0

/* How big is Packages[.gz] as a function of the number of packages */
compressed_bytes_per_pkg=200.0
uncompressed_bytes_per_pkg=800.0
Packagesgz_size=npkgs*compressed_bytes_per_pkg
Packages_size=npkgs*uncompressed_bytes_per_pkg

/* The average number of packages that change each day */
changes_per_day=75.0

/* The probability that a given package will change on a given day. */
p=changes_per_day/npkgs

/*****************************
 * The probability that a given GET request has a If-Modified-Since
 * field that is i days old. 
 *****************************/

/* My lame approximation */
/* tprob(i)=1.0/(i+1)^3 */

tprob(i)=(2.0/3)^i

/* A guess derived from the survey on debianplanet.org */
/*
tprob(x)=1/x^1.25
*/

/* Normalize the probability defined above */
tpn=sum(i=1,100000,tprob(i))
prob(i)=tprob(i)/tpn

/**************************
 * Info about http/ftp queries
 **************************/

/* The overhead of requesting a file download.
   This is definitely an overestimate. */
query_size=1000.0

/***************************
 * utility functions
 ***************************/

/* Convert bytes to KB, rounded */
toK(x)=round(10.0*x/1024)/10.0

/***************************
 * Diff method 
 ***************************/

/* From Anthony Towns: the average size of a bzip2 compressed 
   diff -ed between two successive Packages files is 12K. */
diff_bytes_per_change=12.0*1024/changes_per_day

/* Expected size of a compressed diff between two
   Packages files seperated by s days. */
diff_size(s) = (1-(1-p)^s)*npkgs*diff_bytes_per_change

/* Amount of diskspace required to keep d days of diffs,
   each diff spanning s days. */
diff_diskspace(s, d) = sum(x=1,s,diff_size(x))+(d-s)*diff_size(s)

/* Amount of bandwidth used by a user performing an 
   'apt-get update' after i days */
diff_bwidth(s, d, i) = if (i<=s, query_size + diff_size(i),                     
                        \
                                 if (i>d, query_size +
Packagesgz_size,                                        \
                                          query_size + diff_size(s) +
diff_bwidth(s,d,i-s)))

/* Average amount of bandwidth used by a user */
diff_ebwidth(s, d) =
sum(x=1,d,prob(x)*diff_bwidth(s,d,x))+sum(x=d+1,180,prob(x))*diff_bwidth(s,d,d+1)

/* Display a table of the average bandwidth used for varying numbers of
days */
diff_table(s,m) = print ("days\tdspace\t\tebwidth");                            
                \
                  print ("-------------------------------");                    
                \
                  for (d=1, m, printp
(d,"\t",toK(diff_diskspace(s,d)),"K\t\t",                       \
                                             
toK(diff_ebwidth(s,d)),"K"))

/**************************
 * xdelta method
 **************************/

/* Basically the same as the diff method, but with a slightly larger
bytes_per_change */

/**************************
 * checksums method
 **************************/

/* Number of blocks. */
csum_nblocks(bs)=Packagesgz_size/bs

/* The size of the checksum file. */
csum_diskspace(cs,bs) = csum_nblocks(bs)*cs

/* The average number of package descriptions that will live in one
block */
csum_pkgs_per_block(bs)=bs/compressed_bytes_per_pkg

/* The probability that a given block changes after s days. */
csum_block_prob_changed(bs,s)=1-(1-p)^(s*csum_pkgs_per_block(bs))

/* The expected number of changed blocks after s days */
csum_changed_blocks(bs,s)=csum_block_prob_changed(bs,s)*csum_nblocks(bs)

/* The expected amount of bandwidth used by a user doing an update after
   i days. */
csum_bwidth(cs,bs,i)=query_size + csum_diskspace(cs,bs) +
csum_changed_blocks(bs,i) * bs

/* The average amount of bandwidth used by a user. */
csum_ebwidth(cs,bs)=sum(x=1,180,prob(x)*csum_bwidth(cs,bs,x))

csum_table(cs,minbs,maxbs,step)=print ("bsize\tdspace\t\tebwidth");             
                \
                                print
("-------------------------------");                    \
                                forstep(bs=minbs, maxbs, step,                  
                \
                                        printp (bs,"\t",
toK(csum_diskspace(cs,bs)), "K\t\t",    \
                                                        
toK(csum_ebwidth(cs,bs)),"K"))

/**************************
 * rsync method
 **************************/




-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]


Reply via email to