Hi,

On 29/06/05 20:43, Shachar Shemesh wrote:

> Ok, so maybe it will. Not as elegantly as storing a single ~60 byte
> state per file, though.

True (and my original objection), but more efficient in communication --
the overhead per delta is just due to AES block alignment.



>> Specifically, you're
>> encrypting the compressed plaintext in CBC mode, and resetting the
>> chained value to the IV whenever the sum of the last sum_span=256 bytes
>>  
>>
> Where did you get the "256"? It used to be that, but it's now supposed
> to be 8192. I may have left some reference dangling.

I got it from the default values of the key's constructor. Didn't check
if it's overridden elsewhere.


>> is divisible by sum_mod=8192 but no sooner than sum_min_dist=8192 bytes
>> after the last reset (a comment to this effect in the sourcecode would
>> have saved some people some time...).
>>
> The intention was, once AP IV proceedings were out, to publish the full
> paper on the site. Sorry about that.

Sure, but basic sourcecode comments are still a good idea. I'm probably
not the last one looking at the source before reading the paper (or
without being aware that it exists, or skipping it because it's several
years old so it must be out of date).


>> So the expected chunk size (i.e., runs between IV resets) is 16K,
>>
> How did you figure that? I would love to see your mathematical model.
> 
> I tried to build one, and failed. I then contacted a mathematician
> friend of mine, who said that this is a hard problem. By that, he meant
> that mathematicians around have not come to a consensus regarding how a
> model for this problem should look like. If you have, indeed, built a
> model, you may have a publishable paper on your hands.
> 
> I ended up doing a "Monte Carlo", which merely means I measured
> more-or-less the behavior over random data. If memory serves me right,
> the change from 256 span to 8192 was specifically BECAUSE that made the
> expected length to be about 12KB, but bumped the variance to over 4KB.
> In other words, it should be very difficult to guess where an IV reset
> happened without further information.


Oh, I just made a few rough approximations:

At first I just pretended that each new byte completely randomizes the
sum of the running window (ptbuf_sum), so the probability of the sum
being 0 modulo 8192 is 1/8192, and thus the expected time until that
happens is 8192 bytes (plus the first 8192==min_sum_dist where this
isn't tested).

Now, of course, the first assumption above is grossly false -- moving
the rolling window by one byte can change its sum by at most 255, and
moreover it's biased towards smaller changes since each each step
changes the sum by M in -256,...,256 with probability (256-|M|)/256^2.
This means the above was an underestimate (just consider the extreme
cases; if the steps were really small, when the sum started far away
from zero it would take ages to get there), but by how much? This
depends on how quickly the sum tends to rotate throughout the 8192
value. Well, we have a random walk modulo 8192 with the expected step
size being ~256/3=~85, so it will wrap around very roughly every
(8192/85)^2=9288 bytes. That's on the same order of magnitude as the
naive 8192, so you'll get some increase in expected time but not a
dramatic one. Intuitively, this is where the 12KB you mentioned comes
from, and the high variance is because things heavily depend on where's
the sum modulo 8192 when it comes out of the first (untested) 8192 bytes
-- if it's close to zero then it will tend to hit zero much more quickly
then if it's on the opposite side of the mod 8192 circle.

But then, there's also a reverse effect. What's the sum modulo 8192 of
256 random bytes? Its expected value is (256*127.5)%8192=-128, and its
distribution looks like the result of a 256-step random walk with each
step having expected absolute size ~64, so its expected absolute
distance from -128 is very roughly sqrt(256)*64=1024. Thus, when we
start tracking the evolution of the sum by the above analysis, it
actually tends to start close to zero rather than at a random value, and
then for a while the slow random walk of the previous paragraph actually
plays in our favor since it means the sum will stick around zero for
some time as it evolves, giving it more opportunities to actually hit
zero.

I just guestimated that the two effects roughly cancel out.


>> meaning a typical change of N bytes in the compressed plaintext will
>> change roughly N+8K bytes in the ciphertext.
>>
> It's actually more than that. The changes propagate backwards in the
> file, as a side effect of the compression.

Ah. By how much?


>> That's on top of a
>> comparable overhead from rsyncable gzip. Reasonable for many workloads,
>> I presume, though it would be nasty for a database undergoing many tiny
>> updates.
>>
> That's why these are tweakable parameters :-).

Assuming the tweaker can understand the security implications?



>> Suppose the adversary can inject some data into the stream (via a web
>> form or whatever). Then he'll craft a plaintext sequence a consisting of
>> 4097 bytes that sum to 0 modulo RSYNC_WIN=4096 (this forces 'gzip
>> -rsyncable' to a known state) followed by some data whose gzip
>> compression has sum 0 modulo 8192 over a rolling 256-byte window exactly
>> once and no sooner than 8192 from the beginning (this forces the tweaked
>> CBC mode to revert to its IV). Now, whenever this magic sequence is
>> injected, the whole compression+encryption process is reset to a state
>> that is fully determined by the IV, meaning (for example) the next block
>> can be thought of as encrypted under ECB, with all the obvious attacks
>> on that. Now, whether the adversary can inject the magic sequence into
>> the stream just before interesting business data -- that really depends
>> on the circumstances, but is far from unthinkable.
>>
>> To give one marginally realistic scenario for the above, suppose your
>> customer mailing list is stored in a sorted plain text file. Now, I want
>> to find out if you're dealing with [EMAIL PROTECTED] I thus subscribe
>> "devik${MAGICSTRING}pad" to your list, so it would appear just before
>> "[EMAIL PROTECTED]" in the plaintext stream. I also inject
>> "[EMAIL PROTECTED]" into an arbitrary location in the
>> plaintext stream. Then, I look at the next backup and check if the
>> resulting ciphertext blocks are identical. I don't even need to look at
>> more than one backup to do that.
>>
> Interesting attack. You would have to have a security hole in the actual
> database for that to work, of course. You assume "\n" is the record
> separator, but that if you place "\n" inside your own record, it will
> not be escaped when entered into the database. What your attack did, in
> effect, was to inject "[EMAIL PROTECTED]" into the database.

No, I inject just "devik${MAGICSTRING}pad" into the customer textfile,
and let the final "\n" be appended automatically. The other string,
"[EMAIL PROTECTED]", can be injected *anywhere*; say,
into someone's mailbox.


>> I would have replaced
>>  rsyncrypto -r plain_dir cipher_dir keyfile key
>> with
>>  tar cf - plain_dir | rsyncrypto - cipher_file keyfile key
>> (which also gets you the various nifty features of tar).
>>  
> True, but that mode is not very useful to customers. The main problem
> with it is that it does not allow selective restore of information. I am
> working on a mode in which the file names are garbled upon compression,
> and the garbling index is stored in a file (that gets encrypted, just
> like everything else....).

OK...
Does the stdin mode work, anyway? And stdout? For encryption? For
decryption? The docs are a bit confusing on that.


>> Better yet, using rexecsync [1] to completely avoid temporary files, I'd
>> like to do
>>
>> [EMAIL PROTECTED] rexecsync 'ssh [EMAIL PROTECTED]' \
>>           'tar clf - / | rsyncrypto - - keyfile key' \
>>           /backup/fiasco/today
>>
>> (or its secure sudo-based equivalent). Is that a pipe dream?
>>
> Haven't known rexecsync. 

Hey, the above is essentially what Fiasco has been using for a while
(with sudo instead of root but without rsyncrypto). Prior to that we
used rsync patched to read from stdin (but the patch was rejected from
rsync mainline).


> I may even package it for Debian......

Be my guest. One major missing feature (now mentioned in a comment) is
integrity checking on the whole file: rdiff only does block-level
hashing with a (by default) 64-bit hash, which is not quite enough.

  Eran

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to