Re: [RFC] dynamic checksum size

Donovan Baarda Sat, 22 Mar 2003 22:44:30 -0800

On Sun, 2003-03-23 at 07:40, jw schultz wrote:
[...]
> The two attached patches implement per-file dynamic checksum
> lengths.


Nice one.
 
> Here is a quick table showing the file sizes at which
> transition occurs, the sum2 length and the jump in size of
> the block sum table transmitted:
> 
>       filesize     sum2len    transition
>       <125MB          2
>        125MB          3         8KB
>          1GB          4        32KB
>          4GB          5       128KB
>         16GB          6       523KB
>         64GB          7         2MB

My reading of the patch doesn't quite match with the above. The block
size heuristic is 1/10000 the filesize, limited between some miniumum
and maximum. Ignoring the limits, this means the number of blocks is
kept at a constant 10000, leaving you stuck at three bytes.

> This heuristic is a stab in the dark.  It may be that it is
> too aggressive in the rate of increase or insufficiently
> aggressive at starting the increase.  It may be that a 8
> bits of sum2 in conjunction with the 32 bit adler is enough
> for smaller files.  We may also wish to revisit the
> heuristic for block sizes that fixes the block count at 1000
> for files between 700KB and 16MB, or coordinate them so that
> the transitions would be easier.  Both heuristics can be
> changed at will without breaking compatibility.

I posted some more maths on this a while ago to the old rproxy list to
someone inquiring about how this affects librsync. In summary, a
conservative simplification of the relationship between block size,
blocksum size, and file size is;

  p = (s^2)/(B.(2^b))

where:
  
  p is the probablility of a false match (ie, corrupt result)
  s is the file size.
  B is the block size.
  b is the number of bits in the block checksum (blocksum size)

Note that blocksum size is the total blocksum including the rollsum. If
you plug in an "acceptable probability of failure", you get a direct
relationship between file size, block size, and blocksum size. Assuming
a 1/2^n probability of failure is OK, you get;

B = (s^2)/(2^(b-n))

b = n + 2.log2(s) - log2(B)

Note that assuming a fixed blocksum size, the block size must grow at
the square of the file size. Assuming a fixed bock size, the blocksum
must grow at the log of the filesize.

I think it makes sense to grow the block size linearly with file size
(which keeps a constant number of blocks), and then have blocksum size
grow as necessary to make up the difference by default. It might also be
good to have the ability to specify either blocksum size or block size
and have the other estimated using the above equations. Of course you
will want to round up the blocksum size to the nearest byte.

Using a heuristic that aims for 2^m number of blocks and plugging this
in to the above equations gives;

B = s/(2^m)
b = n + log2(s) + m   (Note: 32bits of this is the rollsum)

Assuming 8K number of blocks (m=13), and an acceptable error rate of
less than 1 in 1024 file transfers falling back to whole file (n=10),
you get;

file size       block size      blocksum size
<8M             1K              46 (sum2len=2)
<32M            4K              48 (sum2len=2)
<128M           16K             50 (sum2len=3)
<512M           64K             52 (sum2len=3)
<2G             256K            54 (sum2len=3)
<8G             1M              56 (sum2len=3)
<32G            4M              58 (sum2len=4)
<128G           16M             60 (sum2len=4)

As you can see from this, your heuristic is very conservative for large
files, but optimistic for smaller files. Might I suggest the following
code fragments;

#define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
#define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */

if (block_size) {
  block_len = block_size;
} else {
  block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */  
  block_len = MIN(block_len, BLOCK_SIZE);
  block_len = MAX(block_len, (CHUNK_SIZE / 2));
}

size_t  b = BLOCKSUM_EXP;
size_t  l = len;
while (l >> 1) {
        b+=2;
}
size_t  c = block_len;
while (c >> 1) {
  b--;
}
sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */


-- 
----------------------------------------------------------------------
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
----------------------------------------------------------------------

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

Re: [RFC] dynamic checksum size

Reply via email to