On Tue, Oct 15, 2002 at 02:25:00AM +1000, Donovan Baarda wrote: > On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote: > > On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote: > [...] > > > Does the first pass signature block checksum really only use 2 bytes of the > > > md4sum? That seems pretty damn small to me. For 100M~1G you need at least > > > 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than > > > 64bits, but you should probably increase the block size as well/instead. > > > > It is worth remembering that increasing the block size with > > a fixed checksum size increases the likelihood of two > > unequal blocks having the same checksums. > > I haven't done the maths, but I think the difference this makes is > negiligable, and is far outweighed by the fact that a larger block size > means less blocks.
We've just seen one face of the checksum undersize. This problem is after all because we have unequal blocks that have the same (truncated) checksums. That is with 700 bytes being compressed to a 4 byte checksum. Increasing the block size without increasing the checksum size increases the chance of unequal blocks having the same checksum. > > I think we want both the block and checksum sizes to > > increase with file size. Just increasing block size gains > > diminishing returns but just increasing checksum size will > > cause a non-linear increase in bandwidth requirement. > > Increasing both in tandem is appropriate. Larger files call > > for larger blocks and larger blocks deserve larger > > checksums. > > > > I do think we want a ceiling on block size unless we layer > > the algorithm. The idea of transmitting 300K because a 4K > > block in a 2GB DB file was modified is unsettling. > > I think that command-line overides are critical. Just as you can force a > blocksize, you should be able to force a sigsumsize. However the defaults > should be reasonable. We are finding that fixed size defaults are not reasonable and that the variability really should be per-file. Under such conditions the fixed command-line overrides do more harm than good. Command-line overrides that modify the heuristics (--blocksize-gamma, --checksum-threshold) are more suitable. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: [EMAIL PROTECTED] Remember Cernan and Schmitt -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html