On Sun, 2003-03-23 at 07:40, jw schultz wrote: [...] > The two attached patches implement per-file dynamic checksum > lengths.
Nice one. > Here is a quick table showing the file sizes at which > transition occurs, the sum2 length and the jump in size of > the block sum table transmitted: > > filesize sum2len transition > <125MB 2 > 125MB 3 8KB > 1GB 4 32KB > 4GB 5 128KB > 16GB 6 523KB > 64GB 7 2MB My reading of the patch doesn't quite match with the above. The block size heuristic is 1/10000 the filesize, limited between some miniumum and maximum. Ignoring the limits, this means the number of blocks is kept at a constant 10000, leaving you stuck at three bytes. > This heuristic is a stab in the dark. It may be that it is > too aggressive in the rate of increase or insufficiently > aggressive at starting the increase. It may be that a 8 > bits of sum2 in conjunction with the 32 bit adler is enough > for smaller files. We may also wish to revisit the > heuristic for block sizes that fixes the block count at 1000 > for files between 700KB and 16MB, or coordinate them so that > the transitions would be easier. Both heuristics can be > changed at will without breaking compatibility. I posted some more maths on this a while ago to the old rproxy list to someone inquiring about how this affects librsync. In summary, a conservative simplification of the relationship between block size, blocksum size, and file size is; p = (s^2)/(B.(2^b)) where: p is the probablility of a false match (ie, corrupt result) s is the file size. B is the block size. b is the number of bits in the block checksum (blocksum size) Note that blocksum size is the total blocksum including the rollsum. If you plug in an "acceptable probability of failure", you get a direct relationship between file size, block size, and blocksum size. Assuming a 1/2^n probability of failure is OK, you get; B = (s^2)/(2^(b-n)) b = n + 2.log2(s) - log2(B) Note that assuming a fixed blocksum size, the block size must grow at the square of the file size. Assuming a fixed bock size, the blocksum must grow at the log of the filesize. I think it makes sense to grow the block size linearly with file size (which keeps a constant number of blocks), and then have blocksum size grow as necessary to make up the difference by default. It might also be good to have the ability to specify either blocksum size or block size and have the other estimated using the above equations. Of course you will want to round up the blocksum size to the nearest byte. Using a heuristic that aims for 2^m number of blocks and plugging this in to the above equations gives; B = s/(2^m) b = n + log2(s) + m (Note: 32bits of this is the rollsum) Assuming 8K number of blocks (m=13), and an acceptable error rate of less than 1 in 1024 file transfers falling back to whole file (n=10), you get; file size block size blocksum size <8M 1K 46 (sum2len=2) <32M 4K 48 (sum2len=2) <128M 16K 50 (sum2len=3) <512M 64K 52 (sum2len=3) <2G 256K 54 (sum2len=3) <8G 1M 56 (sum2len=3) <32G 4M 58 (sum2len=4) <128G 16M 60 (sum2len=4) As you can see from this, your heuristic is very conservative for large files, but optimistic for smaller files. Might I suggest the following code fragments; #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */ #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */ if (block_size) { block_len = block_size; } else { block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */ block_len = MIN(block_len, BLOCK_SIZE); block_len = MAX(block_len, (CHUNK_SIZE / 2)); } size_t b = BLOCKSUM_EXP; size_t l = len; while (l >> 1) { b+=2; } size_t c = block_len; while (c >> 1) { b--; } sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */ -- ---------------------------------------------------------------------- ABO: finger [EMAIL PROTECTED] for more info, including pgp key ---------------------------------------------------------------------- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html