Hi, On 2004/03/15 03:49, Donovan Baarda wrote: >>Note that, above, block hash collisions are very easy to find if you >>know checksum_seed. The rolling Fletcher checksum1 is trivially >>defeated. To defeat the k-bit truncated MD4 checksum2, just keep >>generate random blocks having the same checksum1 until you find two with >>the same checksum2; by the birthday paradox it will take about 2^(k/2) >>attempts, where usually k=16 or k=24 with J.W. Schultz's code. > > I haven't really thought much about carefully crafted data to bust > things. I didn't think the Fletcher checksum was that bad. Although I > can see that it would be fairly trivial to craft data for collisions, > wouldn't it be harder to craft data for a Fletcher collision _and_ > attempt to produce a checksum2 collision?
For blocks larger than about 270 bytes, the strength added by the Fletcher sum is close to zero -- you can obtain any Fletcher sum of your choice, with plenty of easily manipulated degrees of freedom remaining. Smaller blocks are a bit messier (e.g., for blocksize<=256 you can't obtain all possibilities of the lower 16 bit of the Fletcher sum), but still doable. So you can sample "sufficiently random" blocks all having the same Fletcher sum (say, zero), and then the birthday paradox applies to strong sum without any special trickery. Just to be sure, I wrote a quick&dirty collision search code for the rsync and librsync hashes. I used a generic collision-finding algorithm (namely Gabriel Nivasch's multistack variant of the generalized Rho method) to search for a truncated MD4 collision among the blocks whose Fletcher sum is zero. It took 13ms to find a collision for the default parameters of rsync version<2.6.0, with either checksum_seed=0 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With a 4 byte strong hash, which is the most you're likely to see with Schultz's code, the above took 1.4 seconds. For librsync with default parameters (8-byte strong hash), finding a collision took a couple of hours. Here are examples of colliding pairs for each of the above case: http://dl.tromer.org/misc/rsync-collisions.tgz I should stress that if you manage to insert such a pair into someone's file, his rsyncs will become slow and his rdiffs will silently fail. Such poisoning is very practical in many settings -- consider website databases and user mailboxes (yes, a pure-ASCII collision can be easily generated with a bit of extra effort). Of course, the attacker will need some knowledge about the file format, and some care in regard to alignment and timing. Still, not a pretty prospect. > The 2^(k/2) attempts assumes random attempts... how does this compare > to "crafted to bust fletcher" attempts? Very favorably. My code performs just as one would expect for a random function. As you mentioned, MD4 has known weaknesses [1]; I'm not sure it's easy to exploit them in our context, but the Fletcher checksum may indeed make it somewhat harder to do so. Anyway, I didn't take advantage of them. > Does checksum_seed really need to be random for the checksum2 to be > random? I know md4 is considered vulnerable compared to md5, but does it > have a poor distribution for a fixed seed? If it does, would it make > sense to switch to md5 rather than randomise the seed? checksum_seed must be random for the checksum2 to be random -- simply, because there is no other source of randomness in the hash calculation, and if everything is deterministic then you can find collisions in advance and put them in the data. This doesn't have anything to do with the cryptographic strength of MD4. But if the attacks of [1] can be adapted to our scenario then things will become even *worse* than the above, and this is a good reason to switch to MD5. >>For the purpose of evaluating the >>probability of retransmission, rsync uses s2length bytes of good hash >>plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really >>apply. > > You can still use the same formula, just don't count checksum1 in the > "checksum size" part of it. Or depending on how much you think it's > worth you could count it as x<32 bits worth of checksum, not the full 32 > bits. In adversarial setting the Fletcher checksum is worthless. So to obtain the intended 2^-10 failure probability, rsync should increase its strong checksum by 4 bytes. That's for rsync in normal mode, where checksum_seed is randomized. For fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure probability in adversarial setting is 1, as demonstrated above. A couple of related notes: librsync really ought to include a whole-file hash, like rsync. Maybe also an option of retransmitting the file in case of failure (where possible). Either way, failure must be reliably detected. Beside increasing reliability, in many cases it would allow use of smaller block checksums. You can't just entrust integrity checking to a subsequent invocation of md5sum or such; re-reading the files is expensive and not always possible. The use of CHAR_OFFSET is a pure waste of CPU cycles. It merely amounts to adding constants to s1 and s2 (blocklen*CHAR_OFFSET and blocklen*(blocklen+1)/2*CHAR_OFFSET, respectively). These depend only on the block length, which you know anyway. Indeed rsync mercifully sets CHAR_OFFSET=0 which is optimized away, but it still obfuscates the code and the relevant comment (rsync.h lines 38-39) is wrong. And speaking of the latter, gcc -O3 vs. -O2 gets a speedup by a factor of 8 to 10 in rsync's get_checksum1() and librsync's rs_calc_weak_sum(). Tested with gcc 3.3.2 on i686 Xeon and Katmai. You may want to make -O3 the default. Regards, Eran [1] http://citeseer.ist.psu.edu/68442.html ftp://ftp.rsa.com/pub/pdfs/bulletn4.pdf http://www.cacr.math.uwaterloo.ca/hac/about/chap9.pdf Remark 9.50 -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html