Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day, From: Wayne Davison [EMAIL PROTECTED] On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote: I think I've just realised what you were getting at; if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. Not so. Copy the file once, and you'd get all the data you'd need to create a new local file using a known-signature attack (as long as the input file didn't change, and that's easy to predict for many files). I think between Eran Tromer and myself we have shown that for an md4 blocksum with 'n' bits and a file with 2^m blocks, you will have to try 2^(n-m) blocks to find a match to a known signature. For librsync's 64 bit strong_sum, even a 4G file will need 2^43 attempts, which is sufficiently hard. Sure, you have all the data you need, but that doesn't make it easy :-) Assuming no md4sum exploits are used... I also don't like the doubling of the I/O-cost on the sending side, so I don't think this is a good way to go. I agree... a random seed gives as good or better protection without the double parse for the signature. However, it does mean you don't get the same signature for the same data. Perhaps there are some other ways to make the signature repeatable without requiring a double parse? Is using the md4sum of the first block only as a seed secure enough? I don't think it is. I think any non-random signature seed needs to take into account the whole file for it to be secure, otherwise it reduces to the birthday algorithm for crafting clashes. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
Hi, On 2004/04/05 07:21, Donovan Baarda wrote: [snip] there are four ways crafted blocks can be use; 1) two crafted blocks in the original file 2) two crafted blocks in the target file 3) a crafted pair of target and original files with matching block(s) 4) a block in the target crafted to match a block in the original [snip] Summary; case 2) has no impact case 4) is of minimal impact in rsync, and sufficiently hard in librsync. librsync needs a whole file checksum. Without it, it silently fails for case 1), 3), and 4). librsync could benefit from a random checksum_seed. It would need to be included in the signature. Without it librsync is vulnerable to cases 1) and 3). [snip] rsync shouldn't need a fixed seed for batch modes... just store the seed in the signature. using a fixed seed makes it vulnerable to 1) and 3). I fully agree with your analysis. I'll just note that in many situations, case 2 can be elevated to case 3 simply by transferring the file twice. Eran -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day, From: Eran Tromer [EMAIL PROTECTED] [...] librsync needs a whole file checksum. Without it, it silently fails for case 1), 3), and 4). librsync could benefit from a random checksum_seed. It would need to be included in the signature. Without it librsync is vulnerable to cases 1) and 3). [snip] rsync shouldn't need a fixed seed for batch modes... just store the seed in the signature. using a fixed seed makes it vulnerable to 1) and 3). I fully agree with your analysis. I'll just note that in many situations, case 2 can be elevated to case 3 simply by transferring the file twice. Yeah... did you see my followup post about the posiblity of using the whole-file checksum as the checksum_seed for the blocksums? I think it would be a good idea for librsync. It does require a double-parse to generate the signature, but is otherwise quite nice. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
On 2004/04/08 08:50, Donovan Baarda wrote: In some cases you might prefer to actually store an signed signature using something like GPG. I think librsync should act as a black box that guarantees file integrity (which, apparently, requires a whole file checksum). If someone wants to add authentication or encryption or whatever on top of that, all the merrier, but let that be done in addition to rsync's own integrity check. That means both librsync and (say) GPG would compute a whole file hash. The communication cost of this duplication is negligible (a typical hash is much smaller than a typical digital signature). The computation is a bit more annoying -- it will roughly double the CPU consumption of the receiver. That can be solved that by revealing the whole file checksum (when available) via the API, so that librsync users can directly sign that checksum instead of computing their own. if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. You can't manipulate individual blocks without it affecting every other blocksum, but the signature for the same file is always the same. Nice :-) Nice indeed, but the cost is enormous: you'll have to read the file twice. When syncing a mostly-unchanged file that's larger than the disk cache, that means doubling the runtime (and disk load) on the receiver's side. Also, it means 'rdiff signature' and equivalents won't work on arbitrary streams. Currently the receiver can tee the output of 'rdiff patch' into both the output file and an instance of 'rdiff signature', in order to save the IO of re-reading the file upon the next sync; this won't work anymore. (How about a built-in option for that, BTW?). More thoughts on using the wholefile sum as the seed; at first I thought this would still be vulnerable to case 3). Using a any single block as the original file and trying to find any block that matches means you still have birthday algorithm numbers of blocks to check (ie 2^(n/2)). However, each block comparison requires the recalculation of the target blocksum using the original blocksum as the seed, resulting in non-birthday algorithm number of checksum calculations (ie, 2^n). I'm afraid it's still vulnerable to case 3 (a pair of target and original files with matching blocks). For simplicity consider single-block files. In this case what you've done is simply to replace the hash function f(x) = truncate(MD4(x,fixed_seed)) with the hash function f'(x) = truncate(MD4(x,MD4(x,fixed_seed))) which happens to be the same as hashing two copies of x in sequence. But the birthday attack doesn't care what's the hash function; it only cares about its input and output sizes (which we didn't change) and that the function is sufficiently random. Eran -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day again, From: Eran Tromer [EMAIL PROTECTED] [...] if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. You can't manipulate individual blocks without it affecting every other blocksum, but the signature for the same file is always the same. Nice :-) Nice indeed, but the cost is enormous: you'll have to read the file twice. When syncing a mostly-unchanged file that's larger than the disk cache, that means doubling the runtime (and disk load) on the receiver's side. Also, it means 'rdiff signature' and equivalents won't work on But the vast majority of the load is in the delta calculation on the sender's side. arbitrary streams. Currently the receiver can tee the output of 'rdiff patch' into both the output file and an instance of 'rdiff signature', in order to save the IO of re-reading the file upon the next sync; this won't work anymore. (How about a built-in option for that, BTW?). True... that is a nice feature... you are slowly turning me off the whole-file checksum as seed :-) More thoughts on using the wholefile sum as the seed; at first I thought this would still be vulnerable to case 3). Using a any single block as the original file and trying to find any block that matches means you still have birthday algorithm numbers of blocks to check (ie 2^(n/2)). However, each block comparison requires the recalculation of the target blocksum using the original blocksum as the seed, resulting in non-birthday algorithm number of checksum calculations (ie, 2^n). I'm afraid it's still vulnerable to case 3 (a pair of target and original files with matching blocks). For simplicity consider single-block files. In this case what you've done is simply to replace the hash function f(x) = truncate(MD4(x,fixed_seed)) with the hash function f'(x) = truncate(MD4(x,MD4(x,fixed_seed))) Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y are the two blocks to be compared. This means you have to re-calculate the hash for every compare, not just once for every block. which happens to be the same as hashing two copies of x in sequence. But the birthday attack doesn't care what's the hash function; it only cares about its input and output sizes (which we didn't change) and that the function is sufficiently random. It also assumes that the hash is done once per sample input, not once per compare. Sure, you still only need to try 2^(n/2) blocks, but you need to calculate 2^n hashes, and that's what really hurts. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
Ahoy, On 2004/04/08 14:16, Donovan Baarda wrote: Nice indeed, but the cost is enormous: you'll have to read the file twice. When syncing a mostly-unchanged file that's larger than the disk cache, that means doubling the runtime (and disk load) on the receiver's side. Also, it means 'rdiff signature' and equivalents won't work on But the vast majority of the load is in the delta calculation on the sender's side. My experience is that when you sync a mostly unchanged large files on modern PCs, both sides are IO-bound. The delta calculation just rolls along at top speed due to the try the next block first heuristic. I'm afraid it's still vulnerable to case 3 (a pair of target and original files with matching blocks). For simplicity consider single-block files. In this case what you've done is simply to replace the hash function f(x) = truncate(MD4(x,fixed_seed)) with the hash function f'(x) = truncate(MD4(x,MD4(x,fixed_seed))) Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y are the two blocks to be compared. This means you have to re-calculate the hash for every compare, not just once for every block. Indeed, you're right. More fundamentally, every time you compute f(x,y) you win iff f(x,y)==f(y,y), otherwise you don't learn anything interesting. So you'll have to compute f about 2^n times. Yes, this looks secure when the hash function is perfectly random. The only reservation is that using the same user-affected seed to hash many user-determined blocks is uncomfortably reminiscent of MD4's known weaknesses. Still, are there reasons beyond the aesthetic to want deterministic signature generation? The costs in IO and flexibility seem very high. Eran -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote: I think I've just realised what you were getting at; if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. Not so. Copy the file once, and you'd get all the data you'd need to create a new local file using a known-signature attack (as long as the input file didn't change, and that's easy to predict for many files). I also don't like the doubling of the I/O-cost on the sending side, so I don't think this is a good way to go. ..wayne.. -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
On Thu, 2004-04-08 at 12:36, Martin Pool wrote: On 5 Apr 2004, Donovan Baarda [EMAIL PROTECTED] wrote: librsync needs a whole file checksum. Without it, it silently fails for case 1), 3), and 4). Yes, a whole-file checksum should be used with it. Presumably something stronger than md4 like SHA-1. md4 is probably good enough for most applications. I know it's not as secure as others, but when you take into account the requirement that the signature match as well, compromising it becomes much more complicated. I think the only question is whether this should be done internally in librsync, or as a separate process. I can see arguments either way. In some cases you might prefer to actually store an signed signature using something like GPG. Yeah... good point. The rdiff program should probably use a whole file md4sum though. librsync could benefit from a random checksum_seed. It would need to be included in the signature. Without it librsync is vulnerable to cases 1) and 3). Random with respect to what? I think it would be nice if repeatedly summing identicaly files gave identical signatures. Maybe it can vary depending on only the input data... The problem is repeatability is what makes it vulnerable. If the content fully determines the signature, the content can be crafted to produce a particular signature. It is relatively easy to calculate two different blocks with the same blocksum if the checksum_seed is known. librsync could be modified to detect when two blocks had the same blocksum in a signature, and permutate the checksum seed in a deterministic way to try repeatedly until a signature without collisions is found. This would give repeatable signatures and protect against case 1), but not case 3). It would also allow crafted files that force m/2 recalculations of the signature. I think I've just realised what you were getting at; if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. You can't manipulate individual blocks without it affecting every other blocksum, but the signature for the same file is always the same. Nice :-) This would fit in nicely with adding a whole-file checksum to the signature... generate the wholefile md4sum, and use that as the starting seed for the individual blocksums. The wholefile sum becomes the seed. This doesn't allow for permutating the seed if it happens to result in blocksum clashes. However, given that the probability if this happening is pretty astronomical (when using the wholefile sum seed) and will be caught by a whole-file checksum miss-match, do we care? It is far more likely to get false block matches when calculating the delta (because sums are calculated at every byte boundary, not just at block boundaries). More thoughts on using the wholefile sum as the seed; at first I thought this would still be vulnerable to case 3). Using a any single block as the original file and trying to find any block that matches means you still have birthday algorithm numbers of blocks to check (ie 2^(n/2)). However, each block comparison requires the recalculation of the target blocksum using the original blocksum as the seed, resulting in non-birthday algorithm number of checksum calculations (ie, 2^n). I've Cc'd this to the rsync list because they might get some ideas from it. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day again, Just revisiting an old thread after some more thought. Eran and I were discussing the vulerability of librsync and rsync to deliberate attempts to craft blocks with matching signatures but different content. It turns out it's disturbingly easy. Here's a bit of context; From: Donovan Baarda [EMAIL PROTECTED] From: Eran Tromer [EMAIL PROTECTED] Donovan Baarda wrote: [...] I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox. Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered pairs of observed blocks, and each pair gives a collision with probability 1^-64, so you need N=~2^32. [...] Doh! Of course... it was exactly the same paradox that showed how high the probability of a collision was in rsync for large files. You are not searching for a collision with a particular block, just between any two blocks. I'm not entirely sure about rsync, but apparently such an attack will make it slow down, but it will not corrupt files. Librsync on the other hand will (currently) silently corrupt files. Before librsync users panic; how much to you really care that a maliciously crafted file is corrupted? In rsync this can be used as a performance degradation attack. However, librsync by default uses a psedo-random checksum_seed, that makes such an attack nearly impossible (however... the original discussion started around using a fixed checksum_seed in batch mode). I've done a bit more thinking on this, and there are four ways crafted blocks can be use; 1) two crafted blocks in the original file 2) two crafted blocks in the target file 3) a crafted pair of target and original files with matching block(s) 4) a block in the target crafted to match a block in the original For case 1), it is possible to detect collisions at signature generation time. Using a random checksum_seed makes it nearly impossible. Combining the two, selecting a new random_seed when a collision is detected, would make this pretty much impossible. Note that a random_seed can be used for batch mode and librsync, but the seed would need to be included in the signature. For case 2), the crafted blocks would not match the original, and hence would be miss data that is directly copied. Currently librsync and rsync don't do any sort of target self referencing matches, so this cannot affect librsync or rsync. Note that there has been discussion on adding this to librsync, and it is supported by the vdelta format. When/if such a thing was added, similar conditions to 1) apply to the target matching algorithm. Any such addition should probably use xdelta style matching anyway, which compares the data, not a strong_sum, making it immune to crafted data. For case 3), crafting a pair of files is made nearly impossible by using the protections for case 1). It may be possible to craft a file that has a reduced set of possible signatures for even a random_seed, but this would require vulnerabilities in the strong_sum and/or random seed generator. Barring such a vulnerability, a random_seed randomises the signature for the original, reducing this to case 4) Case 4) is the nasty one. The protection against case 1) ensures that the signature is randomised for a given original, but once you have the signature, you know the random_seed, and all elements of randomness to the rest of the algorithm are gone. However, this is not a true birthday paradox situation; you are not trying to find any two blocks that match, but a block that matches any block in the original. This is no where near as easy. The number of attempts required is; n ~= 2^(b'-m) where: n is the number of attempts required to find a collision m is from there are 2^m blocks in the signature b' is the number of useful bits in the signature (excluding rollsum) Interestingly, the dynamic blocksum heuristic now in rsync is B = sqrt(s) b = 10 + 2*log2(s) - log2(B) Where: B is the block size s is the file size 32 bits of 'b' is the rollsum which means, assuming 32 bits of rollsum is useless and hence should not be included in b'; m = log2(s)/2 b' = -22 + 2*log2(s) - log2(s)/2 = -22 + (3/2)*log2(s) so the number of attempts is n = 2^(-22 + log2(s)) so the larger the file, the harder it is to compromise, because 'b' grows faster than 'm'. Note that rsync has a lower limit on b' of 16, which corresponds to a filesize of 32M. This equates to n = 2^3 attempts, which is rather low... in the realm of on the fly calculation time. For librsync which uses a fixed b'=64, and default B=2048, this becomes; n = 2^(64 - log2(s/(2^11))) = 2^(75 - log2(s)) Which is 2^43 for a 4G file... not easy at all. For rsync, a malicious sender could easily craft blocks that match the signature but have different content. Big deal. A malicious sender doesn't have to
Re: MD4 checksum_seed
Hi, On 2004/03/17 00:07, Donovan Baarda wrote: Quoth RFC 1950: That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged. Not much of a difference in adversarial setting, though. And not much of a difference I think for rsync in general... I wonder just how large the possible large class is? I don't know, and I don't have the relevant papers at hand. But here's an example, easily generalized: add 128 to the 513th byte and subtract 128 from the 512th byte (counting from the end and assuming no char wrap-around). It might be possible to craft a mapping like xdelta's but using a special sequence calculated for best effect, much like the polynomials in crc32 and crc64. Possibly, for natural or random data (as opposed to adversarial). But in such contexts, random choice is often close to optimal with high probability. Clarification: in such contexts I meant lookup tables, not polynomials (you want to be really careful with the latter). I'll take your word for it :-). Actually I think the polynomials in crc are crafted for common line errors. I'm not sure you can do much crafting without making certain assumptions about the types of errors. In our case I don't think we can make assumptions about the types of errors. I fully concur. Have you looked at the librsync rollsum.c implementation in CVS? It replaced rs_calc_weak_sum and should be a fair bit faster. Here are run times for 7000 calculations over the same all-zero block. checksum.c is librsync's current code (which differs from rsync's only in the chars are signed). rollsum.c is librsync's rollsumc.c from CVS. XDELTA is checksum.c changed to add char-to-short lookup table (I blindly converted every occurance of buf[foo] into lookup_table[buf[foo]]; the lookup table is xdelta's). Numbers in seconds; tested on a Pentium Xeon 1700MHz, gcc 3.2.3. checksum.c -O2 CHAR_OFFSET=0: 273.42 checksum.c -O2 CHAR_OFFSET=31: 300.63 checksum.c -O3 CHAR_OFFSET=0: 31.42 checksum.c -O3 CHAR_OFFSET=31: 35.32 rollsum.c -O2 CHAR_OFFSET=0:99.47 rollsum.c -O2 CHAR_OFFSET=31: 99.22 rollsum.c -O3 CHAR_OFFSET=0: 100.18 rollsum.c -O3 CHAR_OFFSET=31: 99.77 XDELTA -O2:386.07 XDELTA -O3: 32.17 If you use the trivial one-char-at-a-time Fletcher algorithm (e.g., disable the first loop in lib/rsync's code and remember to initialize i=0) you get the following: checksum.c -O2 CHAR_OFFSET=0: 259.81 checksum.c -O2 CHAR_OFFSET=31: 259.41 checksum.c -O3 CHAR_OFFSET=0: 134.06 checksum.c -O3 CHAR_OFFSET=31: 137.44 XDELTA -O2:300.28 XDELTA -O3:135.31 Some things to observe: * The fastest variant is checksum.c with -O3. * The extra cost of xdelta-type lookups is acceptable for -O2 and virtually nonexistant for -O3. * Compared to checksum.c, rollsum.c.c is thrice faster when using -O2 and 3 thrice *slower* when using -O3. * With the current default -O2, checksum.c's four-chars-at-a-time code is worse than the trivial code. * Nonzero CHAR_OFFSET has a small but measurable cost. BTW, in both rsync and librsync's checksum.c, the loop for (i = 0; i (len-4); i+=4) { has an off-by-one limit. It should be i = (len-4). Eran -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: MD4 checksum_seed
Donovan Baarda wrote: On Tue, 2004-03-16 at 10:44, Eran Tromer wrote: with an 8-byte hash that means you tested approximately 2^64/2 crafted fletcher busting blocks to find a collision, yeah? [snip] I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox. Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered pairs of observed blocks, and each pair gives a collision with probability 1^-64, so you need N=~2^32. I glanced at Gabriel Nivasch's stuff but I can't see how it could have shortcut it so much... Nivasch's method is provides a (small) constant factor improvement over other generic collision finding algorithms, such as Floyd's two-finger method. Nothing special there. I guess this assumes that the sum function when applied iteratively to itself the way Gabriel Nivasch suggests will walk through the entire sumspace before repeating. Does the reduced search space assume some sort of distribution of repeat loops in the sum algorithm? Exactly. If you iterate a random function from k bits to k bits, you'll enter a loop in expected time sqrt(pi/2)*2^(k/2). Once you detected a loop, you can back up and recover the collision. But note that this loopy behavior is utilized for reducing intermediate storage; if you could keep 2^32 hash values in memory and check every sample against all old ones, you wouldn't need to iterate anything or rely on loops. Wouldn't this be highly dependent on the checksum algorithm used? Indeed, but in a random way. More exactly, for a random function the above analysis applies, so if it things didn't behave like that for MD4 then we would have found a a very interesting and unexpected property of MD4... I was actually thinking pseudo-random, as in evenly distributed, and with no exploitable pattern ie still requiring a a exhaustive search with no known shortcuts. I haven't used any shortcuts or exploitable patterns; merely the determinism, which allows me to find collision a priori and inject them. Note that each block is hashed independently, so you can't even rely on randomness of preceding blocks (which wouldn't be a good idea anyway). In adversarial setting the Fletcher checksum is worthless. So to obtain the intended 2^-10 failure probability, rsync should increase its strong checksum by 4 bytes. That's for rsync in normal mode, where checksum_seed is randomized. For fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure probability in adversarial setting is 1, as demonstrated above. scary. But easily fixed. The latter, at least. Yeah. I've been thinking the same for some time. Because librsync separates the signature/delta/patch operations, retransmission is an application level decision, but librsync should certainly warn you when it fails. Aye. It would be nice to add (optional) retransmission to rdiff. I have seen several variants for the rollsum; rsync: uses a basic fletcher. Except that strictly speaking, Fletcher uses one's-complement arithmetics. adler32: mod's the s1 and s2 values by the largest 16 bit prime. librsync: adds CHAR_OFFSET as described. xdelta: uses a pre-calculated random mapping from char to ints. I think adler32 achieves nothing but reducing the checksum domain. Quoth RFC 1950: That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged. Not much of a difference in adversarial setting, though. librsync's offset probably does nothing, but I have a feeling it might help a bit for blocksizes less than 256 bytes, because it at least helps use more of the checksum domain for the s1 part. Without it you don't end up using the high bits of s1 at all for small blocks. It doesn't matter. You just shift the small range of possible values -- instead of using a small range of small numbers you'll use a small range of large numbers. Whether two block (having the same length) have the same Fletcher sum is unaffected by the choice of CHAR_OFFSET. This is all that matters. xdelta's random mapping probably does nothing either, but might also help use more of the checksum domain for small blocks because it maps the chars to ints. Indeed. Also, xdelta is more resilient to local collisions that consist of linear changes to the data. For example, in Fletcher you can get a collision by changing any sequence of 4 characters as follows: +1 -1 -1 +1 assuming there is no wrap-around in any of the chars. With random mappings of char to short you won't have such a short vector of differences that always yields a collision. More generally, Fletcher is almost a linear function: Fletcher(block1) + Fletcher(block2) = Fletcher(block1+block2) whenever the character additions don't overflow, which is a bad thing. xdelta is a bit better in this sense -- it
Re: MD4 checksum_seed
G'day, From: Eran Tromer [EMAIL PROTECTED] Donovan Baarda wrote: [...] I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox. Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered pairs of observed blocks, and each pair gives a collision with probability 1^-64, so you need N=~2^32. [...] Doh! Of course... it was exactly the same paradox that showed how high the probability of a collision was in rsync for large files. You are not searching for a collision with a particular block, just between any two blocks. I have seen several variants for the rollsum; rsync: uses a basic fletcher. Except that strictly speaking, Fletcher uses one's-complement arithmetics. [...] Ahh, someone that actually knows :-) I wasn't sure of the exact definitions and names. Quoth RFC 1950: That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged. Not much of a difference in adversarial setting, though. And not much of a difference I think for rsync in general... I wonder just how large the possible large class is? librsync's offset probably does nothing, but I have a feeling it might help a bit for blocksizes less than 256 bytes, because it at least helps use more of the checksum domain for the s1 part. Without it you don't end up using the high bits of s1 at all for small blocks. It doesn't matter. You just shift the small range of possible values -- instead of using a small range of small numbers you'll use a small range of large numbers. [...] Yeah... that makes sense... It might be possible to craft a mapping like xdelta's but using a special sequence calculated for best effect, much like the polynomials in crc32 and crc64. Possibly, for natural or random data (as opposed to adversarial). But in such contexts, random choice is often close to optimal with high probability. I'll take your word for it :-). Actually I think the polynomials in crc are crafted for common line errors. I'm not sure you can do much crafting without making certain assumptions about the types of errors. In our case I don't think we can make assumptions about the types of errors. Have you looked at the librsync rollsum.c implementation in CVS? It replaced rs_calc_weak_sum and should be a fair bit faster. Uhm. Usually loop unrolling is best left for the compiler, and it's counterproductive with some modern processors. Is rollsum.c faster than the old implementation with -O3? I can't remember... there were other cleanups associated with changing to using rollsum that may have impacted. The manual unrolling was taken straight from the adler32 implementation in zlib. I come from an embedded system background where the C compilers are often pretty basic, and manual unrolling makes a big difference. I've been thinking of puting together tests for librsync using the check framework so that we can easily test and time things like this. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: MD4 checksum_seed
Hi, On 2004/03/15 03:49, Donovan Baarda wrote: Note that, above, block hash collisions are very easy to find if you know checksum_seed. The rolling Fletcher checksum1 is trivially defeated. To defeat the k-bit truncated MD4 checksum2, just keep generate random blocks having the same checksum1 until you find two with the same checksum2; by the birthday paradox it will take about 2^(k/2) attempts, where usually k=16 or k=24 with J.W. Schultz's code. I haven't really thought much about carefully crafted data to bust things. I didn't think the Fletcher checksum was that bad. Although I can see that it would be fairly trivial to craft data for collisions, wouldn't it be harder to craft data for a Fletcher collision _and_ attempt to produce a checksum2 collision? For blocks larger than about 270 bytes, the strength added by the Fletcher sum is close to zero -- you can obtain any Fletcher sum of your choice, with plenty of easily manipulated degrees of freedom remaining. Smaller blocks are a bit messier (e.g., for blocksize=256 you can't obtain all possibilities of the lower 16 bit of the Fletcher sum), but still doable. So you can sample sufficiently random blocks all having the same Fletcher sum (say, zero), and then the birthday paradox applies to strong sum without any special trickery. Just to be sure, I wrote a quickdirty collision search code for the rsync and librsync hashes. I used a generic collision-finding algorithm (namely Gabriel Nivasch's multistack variant of the generalized Rho method) to search for a truncated MD4 collision among the blocks whose Fletcher sum is zero. It took 13ms to find a collision for the default parameters of rsync version2.6.0, with either checksum_seed=0 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With a 4 byte strong hash, which is the most you're likely to see with Schultz's code, the above took 1.4 seconds. For librsync with default parameters (8-byte strong hash), finding a collision took a couple of hours. Here are examples of colliding pairs for each of the above case: http://dl.tromer.org/misc/rsync-collisions.tgz I should stress that if you manage to insert such a pair into someone's file, his rsyncs will become slow and his rdiffs will silently fail. Such poisoning is very practical in many settings -- consider website databases and user mailboxes (yes, a pure-ASCII collision can be easily generated with a bit of extra effort). Of course, the attacker will need some knowledge about the file format, and some care in regard to alignment and timing. Still, not a pretty prospect. The 2^(k/2) attempts assumes random attempts... how does this compare to crafted to bust fletcher attempts? Very favorably. My code performs just as one would expect for a random function. As you mentioned, MD4 has known weaknesses [1]; I'm not sure it's easy to exploit them in our context, but the Fletcher checksum may indeed make it somewhat harder to do so. Anyway, I didn't take advantage of them. Does checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed? checksum_seed must be random for the checksum2 to be random -- simply, because there is no other source of randomness in the hash calculation, and if everything is deterministic then you can find collisions in advance and put them in the data. This doesn't have anything to do with the cryptographic strength of MD4. But if the attacks of [1] can be adapted to our scenario then things will become even *worse* than the above, and this is a good reason to switch to MD5. For the purpose of evaluating the probability of retransmission, rsync uses s2length bytes of good hash plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really apply. You can still use the same formula, just don't count checksum1 in the checksum size part of it. Or depending on how much you think it's worth you could count it as x32 bits worth of checksum, not the full 32 bits. In adversarial setting the Fletcher checksum is worthless. So to obtain the intended 2^-10 failure probability, rsync should increase its strong checksum by 4 bytes. That's for rsync in normal mode, where checksum_seed is randomized. For fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure probability in adversarial setting is 1, as demonstrated above. A couple of related notes: librsync really ought to include a whole-file hash, like rsync. Maybe also an option of retransmitting the file in case of failure (where possible). Either way, failure must be reliably detected. Beside increasing reliability, in many cases it would allow use of smaller block checksums. You can't just entrust integrity checking to a subsequent invocation of md5sum or such; re-reading the files is expensive and not always
Re: MD4 checksum_seed
On Tue, 2004-03-16 at 10:44, Eran Tromer wrote: Hi, On 2004/03/15 03:49, Donovan Baarda wrote: [...] Just to be sure, I wrote a quickdirty collision search code for the rsync and librsync hashes. I used a generic collision-finding algorithm (namely Gabriel Nivasch's multistack variant of the generalized Rho method) to search for a truncated MD4 collision among the blocks whose Fletcher sum is zero. It took 13ms to find a collision for the default parameters of rsync version2.6.0, with either checksum_seed=0 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With a 4 byte strong hash, which is the most you're likely to see with Schultz's code, the above took 1.4 seconds. For librsync with default parameters (8-byte strong hash), finding a collision took a couple of hours. Here are examples of colliding pairs for each of the above case: http://dl.tromer.org/misc/rsync-collisions.tgz [...] with an 8-byte hash that means you tested approximately 2^64/2 crafted fletcher busting blocks to find a collision, yeah? Did that really take only a couple of hours? Man, computers are heaps faster now! By my calculations, even processing them at one block every micro-second, that would have taken over 272 years. Are you sure? You must have got _very_ lucky, or there is some serious flaws in md4. Perhaps there is a serious problem with how we are truncating the md4? I just checked your librsync sample blocks... you are right. The first 8 bytes of md4sum are the same, the rollsums both evaluate to 0, and yet the data is different. How the hell did you do that? I glanced at Gabriel Nivasch's stuff but I can't see how it could have shortcut it so much... The 2^(k/2) attempts assumes random attempts... how does this compare to crafted to bust fletcher attempts? [...] Just noticed you use 2^(k/2), not (2^K)/2... this must be part of the reason :-) I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Which for a 64 bit sum would be (2^64)/2. I guess this assumes that the sum function when applied iteratively to itself the way Gabriel Nivasch suggests will walk through the entire sumspace before repeating. Does the reduced search space assume some sort of distribution of repeat loops in the sum algorithm? Wouldn't this be highly dependent on the checksum algorithm used? Does checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed? checksum_seed must be random for the checksum2 to be random -- simply, because there is no other source of randomness in the hash calculation, I was actually thinking pseudo-random, as in evenly distributed, and with no exploitable pattern ie still requiring a a exhaustive search with no known shortcuts. and if everything is deterministic then you can find collisions in advance and put them in the data. This doesn't have anything to do with the cryptographic strength of MD4. But if the attacks of [1] can be adapted to our scenario then things will become even *worse* than the above, and this is a good reason to switch to MD5. [...] Hmmm. yeah. The deterministic nature of a fixed seed is a bit of a problem. You can still use the same formula, just don't count checksum1 in the checksum size part of it. Or depending on how much you think it's worth you could count it as x32 bits worth of checksum, not the full 32 bits. In adversarial setting the Fletcher checksum is worthless. So to obtain the intended 2^-10 failure probability, rsync should increase its strong checksum by 4 bytes. That's for rsync in normal mode, where checksum_seed is randomized. For fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure probability in adversarial setting is 1, as demonstrated above. scary. A couple of related notes: librsync really ought to include a whole-file hash, like rsync. Maybe also an option of retransmitting the file in case of failure (where possible). Either way, failure must be reliably detected. Beside increasing reliability, in many cases it would allow use of smaller block checksums. You can't just entrust integrity checking to a subsequent invocation of md5sum or such; re-reading the files is expensive and not always possible. Yeah. I've been thinking the same for some time. Because librsync separates the signature/delta/patch operations, retransmission is an application level decision, but librsync should certainly warn you when it fails. The use of CHAR_OFFSET is a pure waste of CPU cycles. It merely amounts to adding constants to s1 and s2 (blocklen*CHAR_OFFSET and blocklen*(blocklen+1)/2*CHAR_OFFSET, respectively). These depend only on the block length, which you know
Re: MD4 checksum_seed
On Wed, 2004-03-10 at 16:43, Eran Tromer wrote: [...] Note that, above, block hash collisions are very easy to find if you know checksum_seed. The rolling Fletcher checksum1 is trivially defeated. To defeat the k-bit truncated MD4 checksum2, just keep generate random blocks having the same checksum1 until you find two with the same checksum2; by the birthday paradox it will take about 2^(k/2) attempts, where usually k=16 or k=24 with J.W. Schultz's code. I haven't really thought much about carefully crafted data to bust things. I didn't think the Fletcher checksum was that bad. Although I can see that it would be fairly trivial to craft data for collisions, wouldn't it be harder to craft data for a Fletcher collision _and_ attempt to produce a checksum2 collision? The 2^(k/2) attempts assumes random attempts... how does this compare to crafted to bust fletcher attempts? Another note is that Donovan Baarda's formula for the probability of retransmission (and thus J.W. Schultz's code) assumes that the hashes are random. This is a reasonable assumption for the truncated MD4 checksum2 when checksum_seed is random, but it's a pretty rotten Does checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed? assumption for the Fletcher checksum1. For the purpose of evaluating the probability of retransmission, rsync uses s2length bytes of good hash plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really apply. You can still use the same formula, just don't count checksum1 in the checksum size part of it. Or depending on how much you think it's worth you could count it as x32 bits worth of checksum, not the full 32 bits. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html