On Wed, 2002-04-10 at 02:28, Anthony Towns wrote: 
> I think you'll find you're also unfairly weighting this against people
> who do daily updates. If you do an update once a month, it's not as much
> of a bother waiting a while to download the Packages files -- you're
> going to have to wait _much_ longer to download the packages themselves.
> 
> I'd suggest your formula would be better off being:
> 
>       bandwidthcost = sum( x = 1..30, prob(x) * cost(x) / x )
> 
> (If you update every day for a month, your cost isn't just one download,
> it's 30 downloads. If you update once a week for a month, your cost
> isn't that of a single download, it's four times that. The /x takes that
> into account)

I think it depends on what you're measuring.  I can think of two ways to
measure the "goodness" of these schemes (there are certainly others): 

1. What is the average bandwidth required at the server? 
2. What is the average bandwidth required at the client? 

The two questions are related: If users update after i days with
prob1(i), then the probability that a connection arriving at a server is
from a user updating after i days is 

prob2(i)=(prob1(i)/i)*norm, 

where norm is a normalization factor so the probabilities sum to 1. 
I've been looking at question 2, and you're suggesting that I look at
question 1, except you forgot the normalization factor.  I think this is
what you mean.  Please correct me if I've misunderstood. 

Anyway, here are the results you asked for.  I'm NOT including the
normalization factor for easier comparison with your numbers.  My diff
numbers are a little different from yours mainly because I charge 1K of
overhead for each file request. 

Diff scheme 
days    dspace          ebwidth
-------------------------------
1       12.000K         342.00K
2       24.000K         171.20K
3       36.000K         95.900K
4       48.000K         58.500K
5       60.000K         38.800K
6       72.000K         27.900K
7       84.000K         21.800K
8       96.000K         18.200K
9       108.00K         16.100K
10      120.00K         14.900K
11      132.00K         14.100K
12      144.00K         13.700K
13      156.00K         13.400K
14      168.00K         13.300K
15      180.00K         13.100K

Checksum file scheme with 4 byte checksums:
bsize   dspace          ebwidth
-------------------------------
20      312.50K         173.70K
40      156.30K         89.300K
60      104.20K         62.200K
80      78.100K         49.300K
100     62.500K         42.200K
120     52.100K         37.900K
140     44.600K         35.300K
160     39.100K         33.600K
180     34.700K         32.700K
200     31.300K         32.200K
220     28.400K         32.100K
240     26.000K         32.200K
260     24.000K         32.500K
280     22.300K         33.000K
300     20.800K         33.600K
320     19.500K         34.300K
340     18.400K         35.100K
360     17.400K         35.900K
380     16.400K         36.800K
400     15.600K         37.700K

I'm probably underestimating the bandwidth of the checksum file scheme. 
I'm pretty confident about the diff scheme estimates, though.

I think the performance of the two schemes is pretty close.  Even though
this looks pretty good for the checksum file scheme, I'm still partial
to the diff scheme because 

- The checksum file scheme bottoms out at 32K, but the diff scheme can
reduce transfers to 13K (using more disk space).

- I trust my estimates of the diff scheme more.  The rsync scheme will
definitely take more bandwidth than my estimates predict.

- As debian gets larger, the checksum files will get larger, and so the
bandwidth will get larger.  So over time, any advantage of the checksum
file scheme will disappear.

- The diff scheme is more flexible and easier to tune.  The checksum
file scheme has a "sweet spot" at 220 byt blocks.  Predicting the actual
value of this sweet spot may be hard in the real world.

Best,
Rob



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]


Reply via email to