Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-11 Thread Donovan Baarda
G'day,
From: Wayne Davison [EMAIL PROTECTED]
 On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote:
  I think I've just realised what you were getting at; if the
  checksum_seed is based on something like the whole file md4sum, it
  becomes repeatable, but unpredictable.

 Not so.  Copy the file once, and you'd get all the data you'd need to
 create a new local file using a known-signature attack (as long as the
 input file didn't change, and that's easy to predict for many files).

I think between Eran Tromer and myself we have shown that for an md4
blocksum with 'n' bits and a file with 2^m blocks, you will have to try
2^(n-m) blocks to find a match to a known signature. For librsync's 64 bit
strong_sum, even a 4G file will need 2^43 attempts, which is sufficiently
hard. Sure, you have all the data you need, but that doesn't make it easy
:-)

Assuming no md4sum exploits are used...

 I also don't like the doubling of the I/O-cost on the sending side, so
 I don't think this is a good way to go.

I agree... a random seed gives as good or better protection without the
double parse for the signature.

However, it does mean you don't get the same signature for the same data.
Perhaps there are some other ways to make the signature repeatable without
requiring a double parse?

Is using the md4sum of the first block only as a seed secure enough? I don't
think it is. I think any non-random signature seed needs to take into
account the whole file for it to be secure, otherwise it reduces to the
birthday algorithm for crafting clashes.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Eran Tromer
Hi,

On 2004/04/05 07:21, Donovan Baarda wrote:
[snip]
 there are four ways crafted blocks can be use;
 
 1) two crafted blocks in the original file
 
 2) two crafted blocks in the target file
 
 3) a crafted pair of target and original files with matching
 block(s)
 
 4) a block in the target crafted to match a block in the original
[snip]
 Summary;
 
 case 2) has no impact
 
 case 4) is of minimal impact in rsync, and sufficiently hard in
 librsync.
 
 librsync needs a whole file checksum. Without it, it silently fails for
 case 1), 3), and 4).
 
 librsync could benefit from a random checksum_seed. It would need to be
 included in the signature. Without it librsync is vulnerable to cases 1)
 and 3).
[snip]
 rsync shouldn't need a fixed seed for batch modes... just store the seed
 in the signature. using a fixed seed makes it vulnerable to 1) and 3).

I fully agree with your analysis.
I'll just note that in many situations, case 2 can be elevated to case 3
simply by transferring the file twice.

  Eran
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Donovan Baarda
G'day,

From: Eran Tromer [EMAIL PROTECTED]
[...]
  librsync needs a whole file checksum. Without it, it silently fails for
  case 1), 3), and 4).
 
  librsync could benefit from a random checksum_seed. It would need to be
  included in the signature. Without it librsync is vulnerable to cases 1)
  and 3).
 [snip]
  rsync shouldn't need a fixed seed for batch modes... just store the seed
  in the signature. using a fixed seed makes it vulnerable to 1) and 3).

 I fully agree with your analysis.
 I'll just note that in many situations, case 2 can be elevated to case 3
 simply by transferring the file twice.

Yeah... did you see my followup post about the posiblity of using the
whole-file checksum as the checksum_seed for the blocksums? I think it would
be a good idea for librsync. It does require a double-parse to generate the
signature, but is otherwise quite nice.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Eran Tromer
On 2004/04/08 08:50, Donovan Baarda wrote:
In some cases you might prefer to actually store an signed signature
using something like GPG.

I think librsync should act as a black box that guarantees file
integrity (which, apparently, requires a whole file checksum). If
someone wants to add authentication or encryption or whatever on top of
that, all the merrier, but let that be done in addition to rsync's own
integrity check.

That means both librsync and (say) GPG would compute a whole file hash.
The communication cost of this duplication is negligible (a typical hash
is much smaller than a typical digital signature). The computation is a
bit more annoying -- it will roughly double the CPU consumption of the
receiver. That can be solved that by revealing the whole file checksum
(when available) via the API, so that librsync users can directly sign
that checksum instead of computing their own.


 if the
 checksum_seed is based on something like the whole file md4sum, it
 becomes repeatable, but unpredictable. You can't manipulate individual
 blocks without it affecting every other blocksum, but the signature for
 the same file is always the same. Nice :-)

Nice indeed, but the cost is enormous: you'll have to read the file
twice. When syncing a mostly-unchanged file that's larger than the disk
cache, that means doubling the runtime (and disk load) on the receiver's
side. Also, it means 'rdiff signature' and equivalents won't work on
arbitrary streams. Currently the receiver can tee the output of 'rdiff
patch' into both the output file and an instance of 'rdiff signature',
in order to save the IO of re-reading the file upon the next sync; this
won't work anymore. (How about a built-in option for that, BTW?).


 More thoughts on using the wholefile sum as the seed; at first I thought
 this would still be vulnerable to case 3). Using a any single block as
 the original file and trying to find any block that matches means you
 still have birthday algorithm numbers of blocks to check (ie 2^(n/2)).
 However, each block comparison requires the recalculation of the
 target blocksum using the original blocksum as the seed, resulting
 in non-birthday algorithm number of checksum calculations (ie, 2^n).

I'm afraid it's still vulnerable to case 3 (a pair of target and
original files with matching blocks). For simplicity consider
single-block files. In this case what you've done is simply to replace
the hash function
  f(x) = truncate(MD4(x,fixed_seed))
with the hash function
  f'(x) = truncate(MD4(x,MD4(x,fixed_seed)))
which happens to be the same as hashing two copies of x in sequence.
But the birthday attack doesn't care what's the hash function; it only
cares about its input and output sizes (which we didn't change) and that
the function is sufficiently random.

  Eran

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Donovan Baarda
G'day again,

From: Eran Tromer [EMAIL PROTECTED]
[...]
  if the
  checksum_seed is based on something like the whole file md4sum, it
  becomes repeatable, but unpredictable. You can't manipulate individual
  blocks without it affecting every other blocksum, but the signature for
  the same file is always the same. Nice :-)

 Nice indeed, but the cost is enormous: you'll have to read the file
 twice. When syncing a mostly-unchanged file that's larger than the disk
 cache, that means doubling the runtime (and disk load) on the receiver's
 side. Also, it means 'rdiff signature' and equivalents won't work on

But the vast majority of the load is in the delta calculation on the
sender's side.

 arbitrary streams. Currently the receiver can tee the output of 'rdiff
 patch' into both the output file and an instance of 'rdiff signature',
 in order to save the IO of re-reading the file upon the next sync; this
 won't work anymore. (How about a built-in option for that, BTW?).

True... that is a nice feature... you are slowly turning me off the
whole-file checksum as seed :-)

  More thoughts on using the wholefile sum as the seed; at first I thought
  this would still be vulnerable to case 3). Using a any single block as
  the original file and trying to find any block that matches means you
  still have birthday algorithm numbers of blocks to check (ie 2^(n/2)).
  However, each block comparison requires the recalculation of the
  target blocksum using the original blocksum as the seed, resulting
  in non-birthday algorithm number of checksum calculations (ie, 2^n).

 I'm afraid it's still vulnerable to case 3 (a pair of target and
 original files with matching blocks). For simplicity consider
 single-block files. In this case what you've done is simply to replace
 the hash function
   f(x) = truncate(MD4(x,fixed_seed))
 with the hash function
   f'(x) = truncate(MD4(x,MD4(x,fixed_seed)))

Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y
are the two blocks to be compared. This means you have to re-calculate the
hash for every compare, not just once for every block.

 which happens to be the same as hashing two copies of x in sequence.
 But the birthday attack doesn't care what's the hash function; it only
 cares about its input and output sizes (which we didn't change) and that
 the function is sufficiently random.

It also assumes that the hash is done once per sample input, not once per
compare. Sure, you still only need to try 2^(n/2) blocks, but you need to
calculate 2^n hashes, and that's what really hurts.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Eran Tromer
Ahoy,

On 2004/04/08 14:16, Donovan Baarda wrote:
Nice indeed, but the cost is enormous: you'll have to read the file
twice. When syncing a mostly-unchanged file that's larger than the disk
cache, that means doubling the runtime (and disk load) on the receiver's
side. Also, it means 'rdiff signature' and equivalents won't work on
 
 But the vast majority of the load is in the delta calculation on the
 sender's side.

My experience is that when you sync a mostly unchanged large files on
modern PCs, both sides are IO-bound. The delta calculation just rolls
along at top speed due to the try the next block first heuristic.


I'm afraid it's still vulnerable to case 3 (a pair of target and
original files with matching blocks). For simplicity consider
single-block files. In this case what you've done is simply to replace
the hash function
  f(x) = truncate(MD4(x,fixed_seed))
with the hash function
  f'(x) = truncate(MD4(x,MD4(x,fixed_seed)))
 
 Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y
 are the two blocks to be compared. This means you have to re-calculate the
 hash for every compare, not just once for every block.

Indeed, you're right.
More fundamentally, every time you compute f(x,y) you win iff
f(x,y)==f(y,y), otherwise you don't learn anything interesting. So
you'll have to compute f about 2^n times. Yes, this looks secure when
the hash function is perfectly random. The only reservation is that
using the same user-affected seed to hash many user-determined blocks is
uncomfortably reminiscent of MD4's known weaknesses.

Still, are there reasons beyond the aesthetic to want deterministic
signature generation? The costs in IO and flexibility seem very high.

  Eran

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Wayne Davison
On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote:
 I think I've just realised what you were getting at; if the
 checksum_seed is based on something like the whole file md4sum, it
 becomes repeatable, but unpredictable.

Not so.  Copy the file once, and you'd get all the data you'd need to
create a new local file using a known-signature attack (as long as the
input file didn't change, and that's easy to predict for many files).
I also don't like the doubling of the I/O-cost on the sending side, so
I don't think this is a good way to go.

..wayne..
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-07 Thread Donovan Baarda
On Thu, 2004-04-08 at 12:36, Martin Pool wrote:
 On  5 Apr 2004, Donovan Baarda [EMAIL PROTECTED] wrote:
 
  librsync needs a whole file checksum. Without it, it silently fails for
  case 1), 3), and 4).
 
 Yes, a whole-file checksum should be used with it.  Presumably
 something stronger than md4 like SHA-1.

md4 is probably good enough for most applications. I know it's not as
secure as others, but when you take into account the requirement that
the signature match as well, compromising it becomes much more
complicated.

 I think the only question is whether this should be done internally in
 librsync, or as a separate process.  I can see arguments either way.  

 In some cases you might prefer to actually store an signed signature
 using something like GPG.

Yeah... good point. The rdiff program should probably use a whole file
md4sum though.
 
  librsync could benefit from a random checksum_seed. It would need to be
  included in the signature. Without it librsync is vulnerable to cases 1)
  and 3).
 
 Random with respect to what?  I think it would be nice if repeatedly
 summing identicaly files gave identical signatures.  Maybe it can vary
 depending on only the input data...

The problem is repeatability is what makes it vulnerable. If the content
fully determines the signature, the content can be crafted to produce a
particular signature. It is relatively easy to calculate two different
blocks with the same blocksum if the checksum_seed is known.

librsync could be modified to detect when two blocks had the same
blocksum in a signature, and permutate the checksum seed in a
deterministic way to try repeatedly until a signature without collisions
is found. This would give repeatable signatures and protect against case
1), but not case 3). It would also allow crafted files that force m/2
recalculations of the signature.

I think I've just realised what you were getting at; if the
checksum_seed is based on something like the whole file md4sum, it
becomes repeatable, but unpredictable. You can't manipulate individual
blocks without it affecting every other blocksum, but the signature for
the same file is always the same. Nice :-)

This would fit in nicely with adding a whole-file checksum to the
signature... generate the wholefile md4sum, and use that as the starting
seed for the individual blocksums. The wholefile sum becomes the seed.

This doesn't allow for permutating the seed if it happens to result in
blocksum clashes. However, given that the probability if this happening
is pretty astronomical (when using the wholefile sum seed) and will be
caught by a whole-file checksum miss-match, do we care? It is far more
likely to get false block matches when calculating the delta (because
sums are calculated at every byte boundary, not just at block
boundaries).

More thoughts on using the wholefile sum as the seed; at first I thought
this would still be vulnerable to case 3). Using a any single block as
the original file and trying to find any block that matches means you
still have birthday algorithm numbers of blocks to check (ie 2^(n/2)).
However, each block comparison requires the recalculation of the
target blocksum using the original blocksum as the seed, resulting
in non-birthday algorithm number of checksum calculations (ie, 2^n).

I've Cc'd this to the rsync list because they might get some ideas from
it.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-04 Thread Donovan Baarda
G'day again,

Just revisiting an old thread after some more thought. Eran and I were
discussing the vulerability of librsync and rsync to deliberate attempts
to craft blocks with matching signatures but different content. It turns
out it's disturbingly easy. Here's a bit of context;

From: Donovan Baarda [EMAIL PROTECTED]
 From: Eran Tromer [EMAIL PROTECTED]
  Donovan Baarda wrote:
 [...]
   I thought that without using some sort of known vulnerability the
   formula would be;
 avg number of random attempts = number of possible sum values / 2
 
  Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
  Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
  pairs of observed blocks, and each pair gives a collision with
  probability 1^-64, so you need N=~2^32.
 [...]

 Doh! Of course... it was exactly the same paradox that showed how high the
 probability of a collision was in rsync for large files. You are not
 searching for a collision with a particular block, just between any two
 blocks.

I'm not entirely sure about rsync, but apparently such an attack will
make it slow down, but it will not corrupt files. Librsync on the other
hand will (currently) silently corrupt files. Before librsync users
panic; how much to you really care that a maliciously crafted file is
corrupted? In rsync this can be used as a performance degradation
attack. However, librsync by default uses a psedo-random checksum_seed,
that makes such an attack nearly impossible (however... the original
discussion started around using a fixed checksum_seed in batch mode).

I've done a bit more thinking on this, and there are four ways crafted
blocks can be use;

1) two crafted blocks in the original file

2) two crafted blocks in the target file

3) a crafted pair of target and original files with matching
block(s)

4) a block in the target crafted to match a block in the original

For case 1), it is possible to detect collisions at signature
generation time. Using a random checksum_seed makes it nearly
impossible. Combining the two, selecting a new random_seed when a
collision is detected, would make this pretty much impossible. Note that
a random_seed can be used for batch mode and librsync, but the seed
would need to be included in the signature.

For case 2), the crafted blocks would not match the original, and hence
would be miss data that is directly copied. Currently librsync and
rsync don't do any sort of target self referencing matches, so this
cannot affect librsync or rsync. Note that there has been discussion on
adding this to librsync, and it is supported by the vdelta format.
When/if such a thing was added, similar conditions to 1) apply to the
target matching algorithm. Any such addition should probably use
xdelta style matching anyway, which compares the data, not a
strong_sum, making it immune to crafted data.

For case 3), crafting a pair of files is made nearly impossible by using
the protections for case 1). It may be possible to craft a file that has
a reduced set of possible signatures for even a random_seed, but this
would require vulnerabilities in the strong_sum and/or random seed
generator. Barring such a vulnerability, a random_seed randomises the
signature for the original, reducing this to case 4)

Case 4) is the nasty one. The protection against case 1) ensures that
the signature is randomised for a given original, but once you have
the signature, you know the random_seed, and all elements of randomness
to the rest of the algorithm are gone. However, this is not a true
birthday paradox situation; you are not trying to find any two blocks
that match, but a block that matches any block in the original. This
is no where near as easy. The number of attempts required is;

n ~= 2^(b'-m)

where:

  n is the number of attempts required to find a collision
  m is from there are 2^m blocks in the signature
  b' is the number of useful bits in the signature (excluding rollsum)

Interestingly, the dynamic blocksum heuristic now in rsync is

  B = sqrt(s)
  b = 10 + 2*log2(s) - log2(B)

Where:

  B is the block size
  s is the file size
  32 bits of 'b' is the rollsum

which means, assuming 32 bits of rollsum is useless and hence should 
not be included in b';

  m = log2(s)/2
  b' = -22 + 2*log2(s) - log2(s)/2 = -22 + (3/2)*log2(s)

so the number of attempts is

  n = 2^(-22 + log2(s))

so the larger the file, the harder it is to compromise, because 'b'
grows faster than 'm'. Note that rsync has a lower limit on b' of 16,
which corresponds to a filesize of 32M. This equates to n = 2^3
attempts, which is rather low... in the realm of on the fly calculation
time.

For librsync which uses a fixed b'=64, and default B=2048, this becomes;

  n = 2^(64 - log2(s/(2^11))) = 2^(75 - log2(s))

Which is 2^43 for a 4G file... not easy at all.

For rsync, a malicious sender could easily craft blocks that match the
signature but have different content. Big deal. A malicious sender
doesn't have to 

Re: MD4 checksum_seed

2004-03-17 Thread Eran Tromer
Hi,

On 2004/03/17 00:07, Donovan Baarda wrote:

Quoth RFC 1950:
That 65521 is prime is important to avoid a possible large class of
two-byte errors that leave the check unchanged.
Not much of a difference in adversarial setting, though.
 
 And not much of a difference I think for rsync in general... I wonder just
 how large  the possible large class is?

I don't know, and I don't have the relevant papers at hand. But here's
an example, easily generalized: add 128 to the 513th byte and subtract
128 from the 512th byte (counting from the end and assuming no char
wrap-around).


It might be possible to craft a mapping like xdelta's but using a
special sequence calculated for best effect, much like the polynomials
in crc32 and crc64.

Possibly, for natural or random data (as opposed to adversarial). But
in such contexts, random choice is often close to optimal with high
probability.

Clarification: in such contexts I meant lookup tables, not polynomials
(you want to be really careful with the latter).


 I'll take your word for it :-). Actually I think the polynomials in crc are
 crafted for common line errors. I'm not sure you can do much crafting
 without making certain assumptions about the types of errors. In our case I
 don't think we can make assumptions about the types of errors.

I fully concur.


Have you looked at the librsync rollsum.c implementation in CVS? It
replaced rs_calc_weak_sum and should be a fair bit faster.

Here are run times for 7000 calculations over the same all-zero block.
checksum.c is librsync's current code (which differs from rsync's only
in the chars are signed).
rollsum.c is librsync's rollsumc.c from CVS.
XDELTA is checksum.c changed to add char-to-short lookup table
(I blindly converted every occurance of buf[foo] into
lookup_table[buf[foo]]; the lookup table is xdelta's).
Numbers in seconds; tested on a Pentium Xeon 1700MHz, gcc 3.2.3.

checksum.c -O2 CHAR_OFFSET=0:  273.42
checksum.c -O2 CHAR_OFFSET=31: 300.63
checksum.c -O3 CHAR_OFFSET=0:   31.42
checksum.c -O3 CHAR_OFFSET=31:  35.32
rollsum.c -O2 CHAR_OFFSET=0:99.47
rollsum.c -O2 CHAR_OFFSET=31:   99.22
rollsum.c -O3 CHAR_OFFSET=0:   100.18
rollsum.c -O3 CHAR_OFFSET=31:   99.77
XDELTA -O2:386.07
XDELTA -O3: 32.17

If you use the trivial one-char-at-a-time Fletcher algorithm (e.g.,
disable the first loop in lib/rsync's code and remember to initialize
i=0) you get the following:

checksum.c -O2 CHAR_OFFSET=0:  259.81
checksum.c -O2 CHAR_OFFSET=31: 259.41
checksum.c -O3 CHAR_OFFSET=0:  134.06
checksum.c -O3 CHAR_OFFSET=31: 137.44
XDELTA -O2:300.28
XDELTA -O3:135.31

Some things to observe:
* The fastest variant is checksum.c with -O3.
* The extra cost of xdelta-type lookups is acceptable for -O2 and
  virtually nonexistant for -O3.
* Compared to checksum.c, rollsum.c.c is thrice faster
  when using -O2 and 3 thrice *slower* when using -O3.
* With the current default -O2, checksum.c's four-chars-at-a-time
  code is worse than the trivial code.
* Nonzero CHAR_OFFSET has a small but measurable cost.


BTW, in both rsync and librsync's checksum.c, the loop
for (i = 0; i  (len-4); i+=4) {
has an off-by-one limit. It should be i = (len-4).

  Eran
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: MD4 checksum_seed

2004-03-16 Thread Eran Tromer
Donovan Baarda wrote:

 On Tue, 2004-03-16 at 10:44, Eran Tromer wrote:

 with an 8-byte hash that means you tested approximately  2^64/2 crafted
 fletcher busting blocks to find a collision, yeah?
[snip]
 I thought that without using some sort of known vulnerability the
 formula would be;
   avg number of random attempts = number of possible sum values / 2

Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
pairs of observed blocks, and each pair gives a collision with
probability 1^-64, so you need N=~2^32.


 I glanced at Gabriel Nivasch's stuff but I can't see how it could
 have shortcut it so much...

Nivasch's method is provides a (small) constant factor improvement over
other generic collision finding algorithms, such as Floyd's two-finger
method. Nothing special there.


 I guess this assumes that the sum function when applied iteratively to
 itself the way Gabriel Nivasch suggests will walk through the entire
 sumspace before repeating. Does the reduced search space assume some
 sort of distribution of repeat loops in the sum algorithm?

Exactly. If you iterate a random function from k bits to k bits, you'll
enter a loop in expected time sqrt(pi/2)*2^(k/2). Once you detected a
loop, you can back up and recover the collision. But note that this
loopy behavior is utilized for reducing intermediate storage; if you
could keep 2^32 hash values in memory and check every sample against all
old ones, you wouldn't need to iterate anything or rely on loops.


 Wouldn't this be highly dependent on the checksum algorithm used?

Indeed, but in a random way. More exactly, for a random function the
above analysis applies, so if it things didn't behave like that for MD4
then we would have found a a very interesting and unexpected property of
MD4...


 I was actually thinking pseudo-random, as in evenly distributed, and
 with no exploitable pattern ie still requiring a a exhaustive search
 with no known shortcuts.

I haven't used any shortcuts or exploitable patterns; merely the
determinism, which allows me to find collision a priori and inject them.
Note that each block is hashed independently, so you can't even rely on
randomness of preceding blocks (which wouldn't be a good idea anyway).


In adversarial setting the Fletcher checksum is worthless. So to obtain
the intended 2^-10 failure probability, rsync should increase its strong
checksum by 4 bytes.

That's for rsync in normal mode, where checksum_seed is randomized. For
fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure
probability in adversarial setting is 1, as demonstrated above.

 scary.

But easily fixed. The latter, at least.


 Yeah. I've been thinking the same for some time. Because librsync
 separates the signature/delta/patch operations, retransmission is an
 application level decision, but librsync should certainly warn you when
 it fails.

Aye. It would be nice to add (optional) retransmission to rdiff.


 I have seen several variants for the rollsum;

 rsync: uses a basic fletcher.

Except that strictly speaking, Fletcher uses one's-complement arithmetics.


 adler32: mod's the s1 and s2 values by the largest 16 bit prime.
 librsync: adds CHAR_OFFSET as described.
 xdelta: uses a pre-calculated random mapping from char to ints.

 I think adler32 achieves nothing but reducing the checksum domain.

Quoth RFC 1950:
That 65521 is prime is important to avoid a possible large class of
two-byte errors that leave the check unchanged.
Not much of a difference in adversarial setting, though.


 librsync's offset probably does nothing, but I have a feeling it might
 help a bit for blocksizes less than 256 bytes, because it at least helps
 use more of the checksum domain for the s1 part. Without it you don't
 end up using the high bits of s1 at all for small blocks.

It doesn't matter. You just shift the small range of possible values
-- instead of using a small range of small numbers you'll use a small
range of large numbers.

Whether two block (having the same length) have the same Fletcher sum is
unaffected by the choice of CHAR_OFFSET. This is all that matters.


 xdelta's random mapping probably does nothing either, but might also
 help use more of the checksum domain for small blocks because it maps
 the chars to ints.

Indeed. Also, xdelta is more resilient to local collisions that consist
of linear changes to the data. For example, in Fletcher you can get a
collision by changing any sequence of 4 characters as follows:
  +1 -1 -1 +1
assuming there is no wrap-around in any of the chars. With random
mappings of char to short you won't have such a short vector of
differences that always yields a collision. More generally, Fletcher is
almost a linear function:
  Fletcher(block1) + Fletcher(block2) = Fletcher(block1+block2)
whenever the character additions don't overflow, which is a bad thing.
xdelta is a bit better in this sense -- it 

Re: MD4 checksum_seed

2004-03-16 Thread Donovan Baarda
G'day,

From: Eran Tromer [EMAIL PROTECTED]
 Donovan Baarda wrote:
[...]
  I thought that without using some sort of known vulnerability the
  formula would be;
avg number of random attempts = number of possible sum values / 2

 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
 Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
 pairs of observed blocks, and each pair gives a collision with
 probability 1^-64, so you need N=~2^32.
[...]

Doh! Of course... it was exactly the same paradox that showed how high the
probability of a collision was in rsync for large files. You are not
searching for a collision with a particular block, just between any two
blocks.

  I have seen several variants for the rollsum;
 
  rsync: uses a basic fletcher.

 Except that strictly speaking, Fletcher uses one's-complement arithmetics.
[...]

Ahh, someone that actually knows :-) I wasn't sure of the exact definitions
and names.

 Quoth RFC 1950:
 That 65521 is prime is important to avoid a possible large class of
 two-byte errors that leave the check unchanged.
 Not much of a difference in adversarial setting, though.

And not much of a difference I think for rsync in general... I wonder just
how large  the possible large class is?

  librsync's offset probably does nothing, but I have a feeling it might
  help a bit for blocksizes less than 256 bytes, because it at least helps
  use more of the checksum domain for the s1 part. Without it you don't
  end up using the high bits of s1 at all for small blocks.

 It doesn't matter. You just shift the small range of possible values
 -- instead of using a small range of small numbers you'll use a small
 range of large numbers.
[...]

Yeah... that makes sense...

  It might be possible to craft a mapping like xdelta's but using a
  special sequence calculated for best effect, much like the polynomials
  in crc32 and crc64.

 Possibly, for natural or random data (as opposed to adversarial). But
 in such contexts, random choice is often close to optimal with high
 probability.

I'll take your word for it :-). Actually I think the polynomials in crc are
crafted for common line errors. I'm not sure you can do much crafting
without making certain assumptions about the types of errors. In our case I
don't think we can make assumptions about the types of errors.

  Have you looked at the librsync rollsum.c implementation in CVS? It
  replaced rs_calc_weak_sum and should be a fair bit faster.

 Uhm. Usually loop unrolling is best left for the compiler, and it's
 counterproductive with some modern processors. Is rollsum.c faster than
 the old implementation with -O3?

I can't remember... there were other cleanups associated with changing to
using rollsum that may have impacted.

The manual unrolling was taken straight from the adler32 implementation in
zlib. I come from an embedded system background where the C compilers are
often pretty basic, and manual unrolling makes a big difference.

I've been thinking of puting together tests for librsync using the check
framework so that we can easily test and time things like this.


Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: MD4 checksum_seed

2004-03-15 Thread Eran Tromer
Hi,

On 2004/03/15 03:49, Donovan Baarda wrote:
Note that, above, block hash collisions are very easy to find if you
know checksum_seed. The rolling Fletcher checksum1 is trivially
defeated. To defeat the k-bit truncated MD4 checksum2, just keep
generate random blocks having the same checksum1 until you find two with
the same checksum2; by the birthday paradox it will take about 2^(k/2)
attempts, where usually k=16 or k=24 with J.W. Schultz's code.
 
 I haven't really thought much about carefully crafted data to bust
 things. I didn't think the Fletcher checksum was that bad. Although I
 can see that it would be fairly trivial to craft data for collisions,
 wouldn't it be harder to craft data for a Fletcher collision _and_
 attempt to produce a checksum2 collision? 

For blocks larger than about 270 bytes, the strength added by the
Fletcher sum is close to zero -- you can obtain any Fletcher sum of your
choice, with plenty of easily manipulated degrees of freedom remaining.
Smaller blocks are a bit messier (e.g., for blocksize=256 you can't
obtain all possibilities of the lower 16 bit of the Fletcher sum), but
still doable. So you can sample sufficiently random blocks all having
the same Fletcher sum (say, zero), and then the birthday paradox applies
to strong sum without any special trickery.

Just to be sure, I wrote a quickdirty collision search code for the
rsync and librsync hashes. I used a generic collision-finding algorithm
(namely Gabriel Nivasch's multistack variant of the generalized Rho
method) to search for a truncated MD4 collision among the blocks whose
Fletcher sum is zero. It took 13ms to find a collision for the default
parameters of rsync version2.6.0, with either checksum_seed=0
(disabled) or checksum_seed=32761 (the value fixed for batch mode). With
a 4 byte strong hash, which is the most you're likely to see with
Schultz's code, the above took 1.4 seconds. For librsync with default
parameters (8-byte strong hash), finding a collision took a couple of
hours. Here are examples of colliding pairs for each of the above case:
  http://dl.tromer.org/misc/rsync-collisions.tgz

I should stress that if you manage to insert such a pair into someone's
file, his rsyncs will become slow and his rdiffs will silently fail.
Such poisoning is very practical in many settings -- consider website
databases and user mailboxes (yes, a pure-ASCII collision can be easily
generated with a bit of extra effort). Of course, the attacker will need
some knowledge about the file format, and some care in regard to
alignment and timing. Still, not a pretty prospect.


 The 2^(k/2) attempts assumes random attempts... how does this compare
 to crafted to bust fletcher attempts?

Very favorably. My code performs just as one would expect for a random
function.

As you mentioned, MD4 has known weaknesses [1]; I'm not sure it's easy
to exploit them in our context, but the Fletcher checksum may indeed
make it somewhat harder to do so. Anyway, I didn't take advantage of them.


 Does checksum_seed really need to be random for the checksum2 to be
 random? I know md4 is considered vulnerable compared to md5, but does it
 have a poor distribution for a fixed seed? If it does, would it make
 sense to switch to md5 rather than randomise the seed?

checksum_seed must be random for the checksum2 to be random -- simply,
because there is no other source of randomness in the hash calculation,
and if everything is deterministic then you can find collisions in
advance and put them in the data. This doesn't have anything to do with
the cryptographic strength of MD4. But if the attacks of [1] can be
adapted to our scenario then things will become even *worse* than the
above, and this is a good reason to switch to MD5.


For the purpose of evaluating the
probability of retransmission, rsync uses s2length bytes of good hash
plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really
apply.
 
 You can still use the same formula, just don't count checksum1 in the
 checksum size part of it. Or depending on how much you think it's
 worth you could count it as x32 bits worth of checksum, not the full 32
 bits.

In adversarial setting the Fletcher checksum is worthless. So to obtain
the intended 2^-10 failure probability, rsync should increase its strong
checksum by 4 bytes.

That's for rsync in normal mode, where checksum_seed is randomized. For
fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure
probability in adversarial setting is 1, as demonstrated above.


A couple of related notes:

librsync really ought to include a whole-file hash, like rsync. Maybe
also an option of retransmitting the file in case of failure (where
possible). Either way, failure must be reliably detected. Beside
increasing reliability, in many cases it would allow use of smaller
block checksums. You can't just entrust integrity checking to a
subsequent invocation of md5sum or such; re-reading the files is
expensive and not always 

Re: MD4 checksum_seed

2004-03-15 Thread Donovan Baarda
On Tue, 2004-03-16 at 10:44, Eran Tromer wrote:
 Hi,
 
 On 2004/03/15 03:49, Donovan Baarda wrote:
[...]
 Just to be sure, I wrote a quickdirty collision search code for the
 rsync and librsync hashes. I used a generic collision-finding algorithm
 (namely Gabriel Nivasch's multistack variant of the generalized Rho
 method) to search for a truncated MD4 collision among the blocks whose
 Fletcher sum is zero. It took 13ms to find a collision for the default
 parameters of rsync version2.6.0, with either checksum_seed=0
 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With
 a 4 byte strong hash, which is the most you're likely to see with
 Schultz's code, the above took 1.4 seconds. For librsync with default
 parameters (8-byte strong hash), finding a collision took a couple of
 hours. Here are examples of colliding pairs for each of the above case:
   http://dl.tromer.org/misc/rsync-collisions.tgz
[...]

with an 8-byte hash that means you tested approximately  2^64/2 crafted
fletcher busting blocks to find a collision, yeah? Did that really take
only a couple of hours? Man, computers are heaps faster now!

By my calculations, even processing them at one block every
micro-second, that would have taken over 272 years. Are you sure? You
must have got _very_ lucky, or there is some serious flaws in md4.
Perhaps there is a serious problem with how we are truncating the md4?

I just checked your librsync sample blocks... you are right. The first 8
bytes of md4sum are the same, the rollsums both evaluate to 0, and yet
the data is different.

How the hell did you do that?

I glanced at Gabriel Nivasch's stuff but I can't see how it could have
shortcut it so much... 

  The 2^(k/2) attempts assumes random attempts... how does this compare
  to crafted to bust fletcher attempts?
[...]

Just noticed you use 2^(k/2), not (2^K)/2... this must be part of the
reason :-)

I thought that without using some sort of known vulnerability the
formula would be;

  avg number of random attempts = number of possible sum values / 2

Which for a 64 bit sum would be (2^64)/2.

I guess this assumes that the sum function when applied iteratively to
itself the way Gabriel Nivasch suggests will walk through the entire
sumspace before repeating. Does the reduced search space assume some
sort of distribution of repeat loops in the sum algorithm? Wouldn't
this be highly dependent on the checksum algorithm used?

  Does checksum_seed really need to be random for the checksum2 to be
  random? I know md4 is considered vulnerable compared to md5, but does it
  have a poor distribution for a fixed seed? If it does, would it make
  sense to switch to md5 rather than randomise the seed?
 
 checksum_seed must be random for the checksum2 to be random -- simply,
 because there is no other source of randomness in the hash calculation,

I was actually thinking pseudo-random, as in evenly distributed, and
with no exploitable pattern ie still requiring a a exhaustive search
with no known shortcuts. 

 and if everything is deterministic then you can find collisions in
 advance and put them in the data. This doesn't have anything to do with
 the cryptographic strength of MD4. But if the attacks of [1] can be
 adapted to our scenario then things will become even *worse* than the
 above, and this is a good reason to switch to MD5.
[...]

Hmmm. yeah. The deterministic nature of a fixed seed is a bit of a
problem.

  You can still use the same formula, just don't count checksum1 in the
  checksum size part of it. Or depending on how much you think it's
  worth you could count it as x32 bits worth of checksum, not the full 32
  bits.
 
 In adversarial setting the Fletcher checksum is worthless. So to obtain
 the intended 2^-10 failure probability, rsync should increase its strong
 checksum by 4 bytes.
 
 That's for rsync in normal mode, where checksum_seed is randomized. For
 fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure
 probability in adversarial setting is 1, as demonstrated above.

scary.

 A couple of related notes:
 
 librsync really ought to include a whole-file hash, like rsync. Maybe
 also an option of retransmitting the file in case of failure (where
 possible). Either way, failure must be reliably detected. Beside
 increasing reliability, in many cases it would allow use of smaller
 block checksums. You can't just entrust integrity checking to a
 subsequent invocation of md5sum or such; re-reading the files is
 expensive and not always possible.

Yeah. I've been thinking the same for some time. Because librsync
separates the signature/delta/patch operations, retransmission is an
application level decision, but librsync should certainly warn you when
it fails.

 The use of CHAR_OFFSET is a pure waste of CPU cycles. It merely amounts
 to adding constants to s1 and s2 (blocklen*CHAR_OFFSET and
 blocklen*(blocklen+1)/2*CHAR_OFFSET, respectively). These depend only on
 the block length, which you know 

Re: MD4 checksum_seed

2004-03-14 Thread Donovan Baarda
On Wed, 2004-03-10 at 16:43, Eran Tromer wrote:
[...]
 Note that, above, block hash collisions are very easy to find if you
 know checksum_seed. The rolling Fletcher checksum1 is trivially
 defeated. To defeat the k-bit truncated MD4 checksum2, just keep
 generate random blocks having the same checksum1 until you find two with
 the same checksum2; by the birthday paradox it will take about 2^(k/2)
 attempts, where usually k=16 or k=24 with J.W. Schultz's code.

I haven't really thought much about carefully crafted data to bust
things. I didn't think the Fletcher checksum was that bad. Although I
can see that it would be fairly trivial to craft data for collisions,
wouldn't it be harder to craft data for a Fletcher collision _and_
attempt to produce a checksum2 collision? The 2^(k/2) attempts assumes
random attempts... how does this compare to crafted to bust fletcher
attempts?

 Another note is that Donovan Baarda's formula for the probability of
 retransmission (and thus J.W. Schultz's code) assumes that the hashes
 are random. This is a reasonable assumption for the truncated MD4
 checksum2 when checksum_seed is random, but it's a pretty rotten

Does checksum_seed really need to be random for the checksum2 to be
random? I know md4 is considered vulnerable compared to md5, but does it
have a poor distribution for a fixed seed? If it does, would it make
sense to switch to md5 rather than randomise the seed?

 assumption for the Fletcher checksum1. For the purpose of evaluating the
 probability of retransmission, rsync uses s2length bytes of good hash
 plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really
 apply.

You can still use the same formula, just don't count checksum1 in the
checksum size part of it. Or depending on how much you think it's
worth you could count it as x32 bits worth of checksum, not the full 32
bits.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html