Re: [librsync-users] MD4 second-preimage attack

2006-03-01 Thread Donovan Baarda
On Tue, 2006-02-21 at 14:58 -0800, [EMAIL PROTECTED] wrote:
 Hi,
 
 A year ago we discussed the strength of the MD4 hash used by rsync and
 librsync, and one of the points mentioned was that only collision
 attacks are known on MD4. Well, a recent paper by Wang et al [1] shows a
 several second preimage attacks. First, there's an algorithm which,
 given a random message and its MD4 hash, finds another message having
 the same hash with probability 2^-56. Second, if the attacker can also
 alter the original message (and thus its hash) slightly, he can find a
 second message having the same hash with just 2^27 MD4 invocations.

For the record, here it is in the SF mailarchive.

http://sourceforge.net/mailarchive/forum.php?forum_id=29760max_rows=25style=flatviewmonth=200404viewday=5

If I understand correctly, provided we add random seeds for blocksums,
these weaknesses would only make attack case 4) easier.

 Doubtless, even stronger attacks will soon be found.
 
 MD4 (with known seed) is thus completely broken, making rsync batch mode
 and librsync unsafe to use in malicious environments. Please do consider
 phasing out MD4.

Remember we only use a small part of the md4sum for the strong
blocksum, and hence are already using less than the full md4sum. We
don't really need that many bits of hash for the blocksum. The important
thing for the blocksum is speed. I think md4 + random seed is pretty
much the best useful hash bits per second for blocksums. We can adjust
the blocksum size to use more bits to compensate for the fact that ~20%
of the bits are insecure if we want.

If someone can provide detailed evidence that some other algorithm
provides more useful-hash-bits/sec, taking into account the random seed,
I'll be convinced :-)

In any case, the first thing that will be tackled is changing to use the
libmd/openssl API for hashes, which will make switching what hash we
want to use painless... a configure option?

 The fastest hash function with no interesting known attacks is SHA-256,
 which is still somewhat expensive (though this can be partially
 addressed by the meta-hash idea discussed on the librsync list last
 July, Re: more info on 25gig files). SHA-1 may also be OK for a while
 despite the known collision attacks, and has acceptable speed.

The whole file checksum is the real protection against attacks, and for
this I think we should use something strong, which means enough useful
hash bits to protect against attack. Remember we only calculate this
once over the whole file, not once for every potential block match.

 Also, as discussed in detail earlier: to thwart some attacks, rsync
 batch mode and librsync should be fixed to use a random seed.
[...]

There are a few things I'm proposing to add to the TODO;

1) Add whole filesum checksums in the signature and delta. The signature
will have the oldfile filesum, the delta will have the oldfile and
newfile filesums. Applications will optionally be able to verify the
oldfile filesum against the delta before applying patches, and patch
will evaluate the newfile filesum and return an error if it doesn't
match.

2) Add a blocksum/filesum seed in the signature and delta. This will be
used as the seed for both filesums and blocksums. It will default to
random when generating signatures, but can optionally be specified on
the command-line. The delta and patch operations will use the seed from
the corresponding signature.

3) Add variable blocksum size in the signature. An optimal blocksum size
can be calculated from the filesize, blocksize, and failure probability.
The blocksize can be specified on the commandline, but will default to
sqrt(filesize). The blocksum size will be calculated the same as rsync.
Note that for this to work with streams, the truncated md4 blocksums
will not be emitted when generating the signature until the oldfile
input is completely read (to get the filesize). This ties in with
feature 4)

4) Add blocksum collision detection when generating the signature. This
will involve keeping the whole md4 blocksums and checking that the
truncated strongsum doesn't match against an existing block that has a
different whole md4sum. If a collision is detected, an error will be
returned. It is up to the application to do things like re-try with
another seed (Note you can't retry for piped input).

I'm not entirely sure about using the seed for the filesums too. It may
be sufficient and convenient to use an unseeded SHA-1 or something, and
the delta would not need to include the seed. However, it is a
no-overhead addition that should help.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Rsync for program loading on embedded platforms

2004-06-16 Thread Donovan Baarda
On Wed, 2004-06-16 at 18:01, Greger Cronquist wrote:
 Hi again,
 
 This issue solved itself rather annoyingly since we use flash memory in 
 our newer devices (which had slipped my mind). So we must erase the 
 memory first and then load.

Last time I played with flash, you could erase a block at a time. The
flash erase and write times are quite long, and flash has a limited
number of write cycles, so avoiding any re-writing is a good idea.

The simple solution here is a flash block aligned signature, and do a
block by block compare/transfer, so only blocks that are changed are
transfered/erased/written. There is no benefit to a byte-aligned rolling
window/checksum, unless you do some fancy partial-in-RAM updating tricks
before erasing/writing the blocks.

Another complicated variant that might be useful if your data size is
significantly larger than the flash block-size would be an rsync-like
algorithm that rolls a flash-block-aligned (instead of byte aligned)
window (rsync-block) through the flash. This would only be worth it if
the window was at least 8 times the size of the flash-blocks.

If you want to experiment with these kinds of algorithms, I thoroughly
recommend pysync... it would be easy to implement and test all these
variants using it.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Rsync for program loading on embedded platforms

2004-06-02 Thread Donovan Baarda
G'day,

From: Greger Cronquist [EMAIL PROTECTED]
[...]
 compiled binaries are often very different for only minor source
 changes. It would be worth analysing your data to see if there are more
[...]
 Is that really true---that the binaries differ that much? Isn't that
 mostly due to relocating the different functions to other areas of the
 binary? Which, I guess, might be hard for in-place rsyncing. I just did
 a quick test with two different binaries using xdelta and rdiff and the
 uncompressed deltas were less than 10% of the original size (xdelta of
 course being the smallest). So I have hopes that some kind of
 diff-related (even if it means keeping the old binary on a PC which we
 do anyway for tracability reasons) might work. Depending, of course on
 the overwrite and cpu issues.

This depends a lot on the compiler and options used. The more optimized
the compiler, the more you often find small source changes resulting in
significant binary changes. Stuff written in assembler usually doesn't
change much though... unless you change some small macro that is used all
over the place.

 The checksum size you can use is a function of the data size, block
 size, and your acceptable failure rate. There are threads where this has
 been analysed and the latest rsync incorporates dynamic checksum and
[...]
 Please correct me if I'm wrong, but don't most of these threads deal
 with finding the optimal block size depending on the file size? For an

No, there are threads where the optimal blocksum size is discussed.

 embedded system we might well use small blocks just to be able to use,
 say, a 16 bit checksum that is much faster to compute. Or even 8 bit
 checksums for that matter. If I understand it correctly, rsync only ever
 uses one (well two if counting both the rolling and the md4 checksums)
 checksum implementation varying the block size it's calculated on.

There is a relationship between the block size, file size, and blocksum size
that can be used to calculate the probability of failure (ie, encountering
two blocks with the same blocksum, corrupting the result). Given an
acceptable failure rate (say, 1 in 2^10 file transfers) and file size, you
can pick a block size and thus calculate your blocksum size. eg, for a small
file;

file_len = 64K
block_len = sqrt(file_len) = 256 bytes
blocksum_bits = 10 + 2*log2(file_len) - log2(block_len) = 34 bits

The 32 bit rollsum alone is not enough to give a better than 1 in 2^10
chance of failure... you need at least two more bits of checksum, which you
may as well round up to the nearest byte. Also, the rollsum is not a very
good checksum, so relying on it to provide a good 32 bits of the checksum
might not be wise... particularly if you start taking into account
maliciously crafted blocks.

 As of yet, this is more of an academic interest spurred by the annoying
 delays in the compile-download-test loop... :-)

Ahh... remember them well :-)

I distinctly remember one particular system that attempted to minimize the
download-test loop by doing block-wise checksums... most of the time it
didn't save you much, and rsync would have been a distinct improvement.


Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Rsync for program loading on embedded platforms

2004-06-01 Thread Donovan Baarda
On Tue, 2004-06-01 at 17:44, Greger Cronquist wrote:
 Hi,
 
 Has anyone considered using the rsync algorithm for loading programs on 
 embedded platforms? I.e. with low cpu power (say, 16-25 MHz ARM7) and 
 low memory requirements (programs are  128 KB, and the rsync algorithm 

I've been doing some work on librsync. As I come from an embedded
background, I've always been conscious of embedded requirements as I
worked on it.

 must use little ROM and RAM). Or is it just not worth the effort?

The algorithm sacrifices CPU to minimise data transfer when updating it.
For it to be worth it, you must already have similar data that needs
updating, and data transfer must be expensive relative to CPU.

Often with embedded systems you are loading programs from scratch, so
there is nothing to update. Even if you are updating programs,
compiled binaries are often very different for only minor source
changes. It would be worth analysing your data to see if there are more
application specific ways to minimise updates (like partitioning data
into discrete parts that can be updated independently).

 I guess you'd want to at least modify the checksum calculations and use 
 shorter and faster checksums (as the data to be transferred is smaller 
 than on workstations), and you'd need that file overwrite patch that's 
 been floating around.

The checksum size you can use is a function of the data size, block
size, and your acceptable failure rate. There are threads where this has
been analysed and the latest rsync incorporates dynamic checksum and
block sizing based on that discussion. librsync doesn't have it yet, but
it could easily be added.

I'd be interested to hear of anyone using or contemplating the rsync
algorithm on embedded systems.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

2004-05-18 Thread Donovan Baarda
On Wed, 2004-05-19 at 06:10, Chris Shoemaker wrote:
 On Tue, May 18, 2004 at 10:06:52AM -0400, Alberto Accomazzi wrote:
  Chris Shoemaker wrote:
[...]
  that the feature is useless, but just caution people that they need to 
  understand the assumptions that this use of rsync is based upon.  Also, 
  I would suggest checking out other diff/patch tools such as rdiff-backup 
  or xdelta.
 
   I looked at theses but didn't see how they could help me in my 
 situation (same as what you described).  Am I missing something?

Perhaps he failed to go low enough :-)

rdiff-backup is built on librsync, which has a command line tool rdiff.
rdiff has seperate command line options for; generate signature,
calculate delta, and apply delta. This allows you build your own batch
mode tools. Note that it is only one file at a time; you would have to
build your own directory walking tools etc.

xdelta does the same thing, except it doesn't use signatures. It is much
more like a efficient generic binary diff/patch tool. This makes it
unsuitable for some applications, but its deltas are much more
efficient.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: gzip-rsyncable.diff

2004-05-03 Thread Donovan Baarda
G'day,

From: Mike McCallister [EMAIL PROTECTED]
[...]
 http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff

 Since the patch was created a while ago and is not part of the default
 gzip yet, I was wondering if rsync users on this list had any opinions
 on the patch regarding its effectiveness and/or any problems it may
 cause with gzip.  Is this patch the way to go, or am I better off trying
 out a different compression algorithm (not gzip related) altogether?

I don't know how cleanly the patch will apply to recent gzip's, but it
should be OK as gzip itself seems to be pretty stable these days.

I do know how it works. After posting my firm opinion that it was not
possible, I was proven wrong. The idea it uses is very clever and highly
effective. It will have a negative impact on the compression because it
resets the compressor at semi-regular intervals. It will also miss some
matches that you would get on uncompressed data as it requires a slightly
longer run of match data to re-syncronise.

However, the compression losses and missed matches are reported to be
minimal. I'd say it is worth trying.


Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-11 Thread Donovan Baarda
G'day,
From: Wayne Davison [EMAIL PROTECTED]
 On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote:
  I think I've just realised what you were getting at; if the
  checksum_seed is based on something like the whole file md4sum, it
  becomes repeatable, but unpredictable.

 Not so.  Copy the file once, and you'd get all the data you'd need to
 create a new local file using a known-signature attack (as long as the
 input file didn't change, and that's easy to predict for many files).

I think between Eran Tromer and myself we have shown that for an md4
blocksum with 'n' bits and a file with 2^m blocks, you will have to try
2^(n-m) blocks to find a match to a known signature. For librsync's 64 bit
strong_sum, even a 4G file will need 2^43 attempts, which is sufficiently
hard. Sure, you have all the data you need, but that doesn't make it easy
:-)

Assuming no md4sum exploits are used...

 I also don't like the doubling of the I/O-cost on the sending side, so
 I don't think this is a good way to go.

I agree... a random seed gives as good or better protection without the
double parse for the signature.

However, it does mean you don't get the same signature for the same data.
Perhaps there are some other ways to make the signature repeatable without
requiring a double parse?

Is using the md4sum of the first block only as a seed secure enough? I don't
think it is. I think any non-random signature seed needs to take into
account the whole file for it to be secure, otherwise it reduces to the
birthday algorithm for crafting clashes.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Donovan Baarda
G'day,

From: Eran Tromer [EMAIL PROTECTED]
[...]
  librsync needs a whole file checksum. Without it, it silently fails for
  case 1), 3), and 4).
 
  librsync could benefit from a random checksum_seed. It would need to be
  included in the signature. Without it librsync is vulnerable to cases 1)
  and 3).
 [snip]
  rsync shouldn't need a fixed seed for batch modes... just store the seed
  in the signature. using a fixed seed makes it vulnerable to 1) and 3).

 I fully agree with your analysis.
 I'll just note that in many situations, case 2 can be elevated to case 3
 simply by transferring the file twice.

Yeah... did you see my followup post about the posiblity of using the
whole-file checksum as the checksum_seed for the blocksums? I think it would
be a good idea for librsync. It does require a double-parse to generate the
signature, but is otherwise quite nice.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-08 Thread Donovan Baarda
G'day again,

From: Eran Tromer [EMAIL PROTECTED]
[...]
  if the
  checksum_seed is based on something like the whole file md4sum, it
  becomes repeatable, but unpredictable. You can't manipulate individual
  blocks without it affecting every other blocksum, but the signature for
  the same file is always the same. Nice :-)

 Nice indeed, but the cost is enormous: you'll have to read the file
 twice. When syncing a mostly-unchanged file that's larger than the disk
 cache, that means doubling the runtime (and disk load) on the receiver's
 side. Also, it means 'rdiff signature' and equivalents won't work on

But the vast majority of the load is in the delta calculation on the
sender's side.

 arbitrary streams. Currently the receiver can tee the output of 'rdiff
 patch' into both the output file and an instance of 'rdiff signature',
 in order to save the IO of re-reading the file upon the next sync; this
 won't work anymore. (How about a built-in option for that, BTW?).

True... that is a nice feature... you are slowly turning me off the
whole-file checksum as seed :-)

  More thoughts on using the wholefile sum as the seed; at first I thought
  this would still be vulnerable to case 3). Using a any single block as
  the original file and trying to find any block that matches means you
  still have birthday algorithm numbers of blocks to check (ie 2^(n/2)).
  However, each block comparison requires the recalculation of the
  target blocksum using the original blocksum as the seed, resulting
  in non-birthday algorithm number of checksum calculations (ie, 2^n).

 I'm afraid it's still vulnerable to case 3 (a pair of target and
 original files with matching blocks). For simplicity consider
 single-block files. In this case what you've done is simply to replace
 the hash function
   f(x) = truncate(MD4(x,fixed_seed))
 with the hash function
   f'(x) = truncate(MD4(x,MD4(x,fixed_seed)))

Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y
are the two blocks to be compared. This means you have to re-calculate the
hash for every compare, not just once for every block.

 which happens to be the same as hashing two copies of x in sequence.
 But the birthday attack doesn't care what's the hash function; it only
 cares about its input and output sizes (which we didn't change) and that
 the function is sufficiently random.

It also assumes that the hash is done once per sample input, not once per
compare. Sure, you still only need to try 2^(n/2) blocks, but you need to
calculate 2^n hashes, and that's what really hurts.


Donovan Baardahttp://minkirri.apana.org.au/~abo/




-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-07 Thread Donovan Baarda
On Thu, 2004-04-08 at 12:36, Martin Pool wrote:
 On  5 Apr 2004, Donovan Baarda [EMAIL PROTECTED] wrote:
 
  librsync needs a whole file checksum. Without it, it silently fails for
  case 1), 3), and 4).
 
 Yes, a whole-file checksum should be used with it.  Presumably
 something stronger than md4 like SHA-1.

md4 is probably good enough for most applications. I know it's not as
secure as others, but when you take into account the requirement that
the signature match as well, compromising it becomes much more
complicated.

 I think the only question is whether this should be done internally in
 librsync, or as a separate process.  I can see arguments either way.  

 In some cases you might prefer to actually store an signed signature
 using something like GPG.

Yeah... good point. The rdiff program should probably use a whole file
md4sum though.
 
  librsync could benefit from a random checksum_seed. It would need to be
  included in the signature. Without it librsync is vulnerable to cases 1)
  and 3).
 
 Random with respect to what?  I think it would be nice if repeatedly
 summing identicaly files gave identical signatures.  Maybe it can vary
 depending on only the input data...

The problem is repeatability is what makes it vulnerable. If the content
fully determines the signature, the content can be crafted to produce a
particular signature. It is relatively easy to calculate two different
blocks with the same blocksum if the checksum_seed is known.

librsync could be modified to detect when two blocks had the same
blocksum in a signature, and permutate the checksum seed in a
deterministic way to try repeatedly until a signature without collisions
is found. This would give repeatable signatures and protect against case
1), but not case 3). It would also allow crafted files that force m/2
recalculations of the signature.

I think I've just realised what you were getting at; if the
checksum_seed is based on something like the whole file md4sum, it
becomes repeatable, but unpredictable. You can't manipulate individual
blocks without it affecting every other blocksum, but the signature for
the same file is always the same. Nice :-)

This would fit in nicely with adding a whole-file checksum to the
signature... generate the wholefile md4sum, and use that as the starting
seed for the individual blocksums. The wholefile sum becomes the seed.

This doesn't allow for permutating the seed if it happens to result in
blocksum clashes. However, given that the probability if this happening
is pretty astronomical (when using the wholefile sum seed) and will be
caught by a whole-file checksum miss-match, do we care? It is far more
likely to get false block matches when calculating the delta (because
sums are calculated at every byte boundary, not just at block
boundaries).

More thoughts on using the wholefile sum as the seed; at first I thought
this would still be vulnerable to case 3). Using a any single block as
the original file and trying to find any block that matches means you
still have birthday algorithm numbers of blocks to check (ie 2^(n/2)).
However, each block comparison requires the recalculation of the
target blocksum using the original blocksum as the seed, resulting
in non-birthday algorithm number of checksum calculations (ie, 2^n).

I've Cc'd this to the rsync list because they might get some ideas from
it.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed

2004-04-04 Thread Donovan Baarda
G'day again,

Just revisiting an old thread after some more thought. Eran and I were
discussing the vulerability of librsync and rsync to deliberate attempts
to craft blocks with matching signatures but different content. It turns
out it's disturbingly easy. Here's a bit of context;

From: Donovan Baarda [EMAIL PROTECTED]
 From: Eran Tromer [EMAIL PROTECTED]
  Donovan Baarda wrote:
 [...]
   I thought that without using some sort of known vulnerability the
   formula would be;
 avg number of random attempts = number of possible sum values / 2
 
  Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
  Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
  pairs of observed blocks, and each pair gives a collision with
  probability 1^-64, so you need N=~2^32.
 [...]

 Doh! Of course... it was exactly the same paradox that showed how high the
 probability of a collision was in rsync for large files. You are not
 searching for a collision with a particular block, just between any two
 blocks.

I'm not entirely sure about rsync, but apparently such an attack will
make it slow down, but it will not corrupt files. Librsync on the other
hand will (currently) silently corrupt files. Before librsync users
panic; how much to you really care that a maliciously crafted file is
corrupted? In rsync this can be used as a performance degradation
attack. However, librsync by default uses a psedo-random checksum_seed,
that makes such an attack nearly impossible (however... the original
discussion started around using a fixed checksum_seed in batch mode).

I've done a bit more thinking on this, and there are four ways crafted
blocks can be use;

1) two crafted blocks in the original file

2) two crafted blocks in the target file

3) a crafted pair of target and original files with matching
block(s)

4) a block in the target crafted to match a block in the original

For case 1), it is possible to detect collisions at signature
generation time. Using a random checksum_seed makes it nearly
impossible. Combining the two, selecting a new random_seed when a
collision is detected, would make this pretty much impossible. Note that
a random_seed can be used for batch mode and librsync, but the seed
would need to be included in the signature.

For case 2), the crafted blocks would not match the original, and hence
would be miss data that is directly copied. Currently librsync and
rsync don't do any sort of target self referencing matches, so this
cannot affect librsync or rsync. Note that there has been discussion on
adding this to librsync, and it is supported by the vdelta format.
When/if such a thing was added, similar conditions to 1) apply to the
target matching algorithm. Any such addition should probably use
xdelta style matching anyway, which compares the data, not a
strong_sum, making it immune to crafted data.

For case 3), crafting a pair of files is made nearly impossible by using
the protections for case 1). It may be possible to craft a file that has
a reduced set of possible signatures for even a random_seed, but this
would require vulnerabilities in the strong_sum and/or random seed
generator. Barring such a vulnerability, a random_seed randomises the
signature for the original, reducing this to case 4)

Case 4) is the nasty one. The protection against case 1) ensures that
the signature is randomised for a given original, but once you have
the signature, you know the random_seed, and all elements of randomness
to the rest of the algorithm are gone. However, this is not a true
birthday paradox situation; you are not trying to find any two blocks
that match, but a block that matches any block in the original. This
is no where near as easy. The number of attempts required is;

n ~= 2^(b'-m)

where:

  n is the number of attempts required to find a collision
  m is from there are 2^m blocks in the signature
  b' is the number of useful bits in the signature (excluding rollsum)

Interestingly, the dynamic blocksum heuristic now in rsync is

  B = sqrt(s)
  b = 10 + 2*log2(s) - log2(B)

Where:

  B is the block size
  s is the file size
  32 bits of 'b' is the rollsum

which means, assuming 32 bits of rollsum is useless and hence should 
not be included in b';

  m = log2(s)/2
  b' = -22 + 2*log2(s) - log2(s)/2 = -22 + (3/2)*log2(s)

so the number of attempts is

  n = 2^(-22 + log2(s))

so the larger the file, the harder it is to compromise, because 'b'
grows faster than 'm'. Note that rsync has a lower limit on b' of 16,
which corresponds to a filesize of 32M. This equates to n = 2^3
attempts, which is rather low... in the realm of on the fly calculation
time.

For librsync which uses a fixed b'=64, and default B=2048, this becomes;

  n = 2^(64 - log2(s/(2^11))) = 2^(75 - log2(s))

Which is 2^43 for a 4G file... not easy at all.

For rsync, a malicious sender could easily craft blocks that match the
signature but have different content. Big deal. A malicious sender
doesn't have

Re: sort of like a diff patch file

2004-03-28 Thread Donovan Baarda
G'day,

On Mon, 2004-03-29 at 13:37, Steve W. Ingram wrote:
 Hi there,
 
 I was wondering if there was anyway to use rsync to effectively
 create a 'diff' file?

is this a FAQ yet?

A) rdiff.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: sort of like a diff patch file

2004-03-28 Thread Donovan Baarda
On Mon, 2004-03-29 at 15:31, Greger Cronquist wrote:
 For just creating diffs, xdelta is even better (in that it creates 
 smaller diffs very quickly)

xdelta requires that you have local access to the two files you want to
diff... librsync's rdiff allows you to calculate a small signature which
you diff against, allowing you do calculate diffs without local access
to the original you are diffing against. This can be handy for many
applications.

We've lost the original application enquired about... but certainly
xdelta should be better if you have local access to both files.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: MD4 checksum_seed

2004-03-16 Thread Donovan Baarda
G'day,

From: Eran Tromer [EMAIL PROTECTED]
 Donovan Baarda wrote:
[...]
  I thought that without using some sort of known vulnerability the
  formula would be;
avg number of random attempts = number of possible sum values / 2

 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
 Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
 pairs of observed blocks, and each pair gives a collision with
 probability 1^-64, so you need N=~2^32.
[...]

Doh! Of course... it was exactly the same paradox that showed how high the
probability of a collision was in rsync for large files. You are not
searching for a collision with a particular block, just between any two
blocks.

  I have seen several variants for the rollsum;
 
  rsync: uses a basic fletcher.

 Except that strictly speaking, Fletcher uses one's-complement arithmetics.
[...]

Ahh, someone that actually knows :-) I wasn't sure of the exact definitions
and names.

 Quoth RFC 1950:
 That 65521 is prime is important to avoid a possible large class of
 two-byte errors that leave the check unchanged.
 Not much of a difference in adversarial setting, though.

And not much of a difference I think for rsync in general... I wonder just
how large  the possible large class is?

  librsync's offset probably does nothing, but I have a feeling it might
  help a bit for blocksizes less than 256 bytes, because it at least helps
  use more of the checksum domain for the s1 part. Without it you don't
  end up using the high bits of s1 at all for small blocks.

 It doesn't matter. You just shift the small range of possible values
 -- instead of using a small range of small numbers you'll use a small
 range of large numbers.
[...]

Yeah... that makes sense...

  It might be possible to craft a mapping like xdelta's but using a
  special sequence calculated for best effect, much like the polynomials
  in crc32 and crc64.

 Possibly, for natural or random data (as opposed to adversarial). But
 in such contexts, random choice is often close to optimal with high
 probability.

I'll take your word for it :-). Actually I think the polynomials in crc are
crafted for common line errors. I'm not sure you can do much crafting
without making certain assumptions about the types of errors. In our case I
don't think we can make assumptions about the types of errors.

  Have you looked at the librsync rollsum.c implementation in CVS? It
  replaced rs_calc_weak_sum and should be a fair bit faster.

 Uhm. Usually loop unrolling is best left for the compiler, and it's
 counterproductive with some modern processors. Is rollsum.c faster than
 the old implementation with -O3?

I can't remember... there were other cleanups associated with changing to
using rollsum that may have impacted.

The manual unrolling was taken straight from the adler32 implementation in
zlib. I come from an embedded system background where the C compilers are
often pretty basic, and manual unrolling makes a big difference.

I've been thinking of puting together tests for librsync using the check
framework so that we can easily test and time things like this.


Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: MD4 checksum_seed

2004-03-15 Thread Donovan Baarda
On Tue, 2004-03-16 at 10:44, Eran Tromer wrote:
 Hi,
 
 On 2004/03/15 03:49, Donovan Baarda wrote:
[...]
 Just to be sure, I wrote a quickdirty collision search code for the
 rsync and librsync hashes. I used a generic collision-finding algorithm
 (namely Gabriel Nivasch's multistack variant of the generalized Rho
 method) to search for a truncated MD4 collision among the blocks whose
 Fletcher sum is zero. It took 13ms to find a collision for the default
 parameters of rsync version2.6.0, with either checksum_seed=0
 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With
 a 4 byte strong hash, which is the most you're likely to see with
 Schultz's code, the above took 1.4 seconds. For librsync with default
 parameters (8-byte strong hash), finding a collision took a couple of
 hours. Here are examples of colliding pairs for each of the above case:
   http://dl.tromer.org/misc/rsync-collisions.tgz
[...]

with an 8-byte hash that means you tested approximately  2^64/2 crafted
fletcher busting blocks to find a collision, yeah? Did that really take
only a couple of hours? Man, computers are heaps faster now!

By my calculations, even processing them at one block every
micro-second, that would have taken over 272 years. Are you sure? You
must have got _very_ lucky, or there is some serious flaws in md4.
Perhaps there is a serious problem with how we are truncating the md4?

I just checked your librsync sample blocks... you are right. The first 8
bytes of md4sum are the same, the rollsums both evaluate to 0, and yet
the data is different.

How the hell did you do that?

I glanced at Gabriel Nivasch's stuff but I can't see how it could have
shortcut it so much... 

  The 2^(k/2) attempts assumes random attempts... how does this compare
  to crafted to bust fletcher attempts?
[...]

Just noticed you use 2^(k/2), not (2^K)/2... this must be part of the
reason :-)

I thought that without using some sort of known vulnerability the
formula would be;

  avg number of random attempts = number of possible sum values / 2

Which for a 64 bit sum would be (2^64)/2.

I guess this assumes that the sum function when applied iteratively to
itself the way Gabriel Nivasch suggests will walk through the entire
sumspace before repeating. Does the reduced search space assume some
sort of distribution of repeat loops in the sum algorithm? Wouldn't
this be highly dependent on the checksum algorithm used?

  Does checksum_seed really need to be random for the checksum2 to be
  random? I know md4 is considered vulnerable compared to md5, but does it
  have a poor distribution for a fixed seed? If it does, would it make
  sense to switch to md5 rather than randomise the seed?
 
 checksum_seed must be random for the checksum2 to be random -- simply,
 because there is no other source of randomness in the hash calculation,

I was actually thinking pseudo-random, as in evenly distributed, and
with no exploitable pattern ie still requiring a a exhaustive search
with no known shortcuts. 

 and if everything is deterministic then you can find collisions in
 advance and put them in the data. This doesn't have anything to do with
 the cryptographic strength of MD4. But if the attacks of [1] can be
 adapted to our scenario then things will become even *worse* than the
 above, and this is a good reason to switch to MD5.
[...]

Hmmm. yeah. The deterministic nature of a fixed seed is a bit of a
problem.

  You can still use the same formula, just don't count checksum1 in the
  checksum size part of it. Or depending on how much you think it's
  worth you could count it as x32 bits worth of checksum, not the full 32
  bits.
 
 In adversarial setting the Fletcher checksum is worthless. So to obtain
 the intended 2^-10 failure probability, rsync should increase its strong
 checksum by 4 bytes.
 
 That's for rsync in normal mode, where checksum_seed is randomized. For
 fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure
 probability in adversarial setting is 1, as demonstrated above.

scary.

 A couple of related notes:
 
 librsync really ought to include a whole-file hash, like rsync. Maybe
 also an option of retransmitting the file in case of failure (where
 possible). Either way, failure must be reliably detected. Beside
 increasing reliability, in many cases it would allow use of smaller
 block checksums. You can't just entrust integrity checking to a
 subsequent invocation of md5sum or such; re-reading the files is
 expensive and not always possible.

Yeah. I've been thinking the same for some time. Because librsync
separates the signature/delta/patch operations, retransmission is an
application level decision, but librsync should certainly warn you when
it fails.

 The use of CHAR_OFFSET is a pure waste of CPU cycles. It merely amounts
 to adding constants to s1 and s2 (blocklen*CHAR_OFFSET and
 blocklen*(blocklen+1)/2*CHAR_OFFSET, respectively). These depend only on
 the block length, which you know

Re: MD4 checksum_seed

2004-03-14 Thread Donovan Baarda
On Wed, 2004-03-10 at 16:43, Eran Tromer wrote:
[...]
 Note that, above, block hash collisions are very easy to find if you
 know checksum_seed. The rolling Fletcher checksum1 is trivially
 defeated. To defeat the k-bit truncated MD4 checksum2, just keep
 generate random blocks having the same checksum1 until you find two with
 the same checksum2; by the birthday paradox it will take about 2^(k/2)
 attempts, where usually k=16 or k=24 with J.W. Schultz's code.

I haven't really thought much about carefully crafted data to bust
things. I didn't think the Fletcher checksum was that bad. Although I
can see that it would be fairly trivial to craft data for collisions,
wouldn't it be harder to craft data for a Fletcher collision _and_
attempt to produce a checksum2 collision? The 2^(k/2) attempts assumes
random attempts... how does this compare to crafted to bust fletcher
attempts?

 Another note is that Donovan Baarda's formula for the probability of
 retransmission (and thus J.W. Schultz's code) assumes that the hashes
 are random. This is a reasonable assumption for the truncated MD4
 checksum2 when checksum_seed is random, but it's a pretty rotten

Does checksum_seed really need to be random for the checksum2 to be
random? I know md4 is considered vulnerable compared to md5, but does it
have a poor distribution for a fixed seed? If it does, would it make
sense to switch to md5 rather than randomise the seed?

 assumption for the Fletcher checksum1. For the purpose of evaluating the
 probability of retransmission, rsync uses s2length bytes of good hash
 plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really
 apply.

You can still use the same formula, just don't count checksum1 in the
checksum size part of it. Or depending on how much you think it's
worth you could count it as x32 bits worth of checksum, not the full 32
bits.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [patch] Add `--link-by-hash' option.

2004-02-09 Thread Donovan Baarda
On Tue, 2004-02-10 at 07:48, Jason M. Felice wrote:
 This patch adds the --link-by-hash=DIR option, which hard links received
 files in a link farm arranged by MD4 file hash.  The result is that the system
 will only store one copy of the unique contents of each file, regardless of
 the file's name.

Does this mean it also automatically detects renames?

 Anyone have an example of an MD4 collision so I can test that case? :)

How do you recover from that case?

 Patch Summary:
 
 -1   +1Makefile.in
 -0   +304  hashlink.c (new)
 -1   +21   options.c
 -0   +6proto.h
 -5   +21   receiver.c
 -0   +6rsync.c
 -0   +7rsync.h

If this does everything I think it does, then it's a surprisingly small
amount of changes for what it does.
 
 __
-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: File rename detection?

2004-01-28 Thread Donovan Baarda
On Thu, 2004-01-29 at 06:27, jw schultz wrote:
 On Wed, Jan 28, 2004 at 09:06:52PM +0300, ???  wrote:
  Hello!
  
  As I was found rsync do not detect file renaming. If I just copy my
  backup.0.tgz (many Mbytes in size having it's md5) to backup.1.tgz
  (which will be equial in size and md5) it will be the same file
  in fact...
  
  Rsync will delete old file (backup.0.tgz) on the receiving side and
  download new one (backup.1.tgz). I do not think there are any
  difficulties to detect this situation and follow the natural way:
  just rename the file on the receiving side.
  
  Am I right?
 
 Reliably detecting renamed files in a manner that rsync can
 act on it is very difficult.

I believe rsync currently sends whole-file md4sums along with the
signature. At least when using the -c option, it _could_ use the md4sums
to identify identical but moved files.

Actually implementing it is another story :-)

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: how rsync works

2003-11-26 Thread Donovan Baarda
On Tue, 2003-11-25 at 06:04, jw schultz wrote:
 On Mon, Nov 24, 2003 at 09:14:56AM -0600, John Van Essen wrote:
[...]
 I had thought so too but now i'm less sure.
 
 I put it up on October 26th -- for discussion. 
 In the next 14 days there were hits from 107 unique IPs, 73
 of which were in the first 4 days.  Then it settled down to a
 slow, declining, trickle that could have been naught but
 robots.  But no one said anything online or off.  So on the
 8th of November i took it down.  Now over two weeks after i
 took it down and four since putting it up you are the first
 person to even acknowledge it.

I had a look at it and thought it was good...

 The intent was to, if there is interest, get feedback to
 polish the document for placement on the web site.  Or to
 let someone who is perhaps a better writer take it over.
 Not to self-publish the thing.  Perhaps i had to high of
 expectations but i expected at least one or two would say
 something, even if it was just yuck or its a start
 before knowledge of it subsided into archives.

I didn't comment at the time because none of the above applied to me,
and I was slack :-(

 I'll put it back up if i hear more than just a whimper.

At the time I read it I thought; good, something I can point people at
when they ask me. I think it was a good resource.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Synching text files.

2003-10-22 Thread Donovan Baarda
On Thu, 2003-10-23 at 05:39, Kurt Roeckx wrote:
 On Wed, Oct 22, 2003 at 11:46:22AM -0700, jw schultz wrote:
  On Wed, Oct 22, 2003 at 08:31:12PM +0200, Kurt Roeckx wrote:
   Hi,
   
   I'm trying to sync text files where lines get inserted and
   deleted.  From what I understand, rsync uses blocks to compare
   and so can't find much data that matches.
   
   Would it be possible to make it work so it finds blocks that get
   inserted or deleted?
  
  It already does.
 
 Does it only do it on block basis?  Or will it try to find an
 offset inside the block?

It will find matching blocks at arbitrary byte offsets.

Think of the original file as a sequence of fixed sized blocks.
Inserting or deleting a single byte breaks that block so it no longer
matches, but rsync will match all the blocks before and after that
non-matching block.

Note that one think that can break rsync for text files is MSDOS CR/LF
vs Unix LF line termination. This effectively makes a change every line,
and unless you have lines longer than the block size you are using,
rsync will not be able to find a single match.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Pysync 2.24 release, was Re: rsync on OpenVMS

2003-10-17 Thread Donovan Baarda
On Tue, 2003-10-14 at 11:01, Donovan Baarda wrote:
 On Mon, 2003-10-13 at 13:00, John E. Malmberg wrote:
  jw schultz wrote:
   On Sun, Oct 12, 2003 at 12:38:40AM -0400, John E. Malmberg wrote:
[...]
  I have not heard of unison.  I have heard that pysync was successful in 
  a limited test on OpenVMS.  As near as I can tell though the librsync it 
  is based on is a bit out of date.
[...]
 Something possibly worth trying on it is psyco... it compiles python to
 native code on the fly using a simple import psyco. Pure python is a
 bit slow compared to native C implementations, but psyco could help
 close the gap a bit.

Following up on this... I tried using psyco with python2.2 and it cut
the pysync tests on my machine from 21secs down to 14secs... that's a
33% speedup. In the past I'd tried using pyrex to speed things up with
no success. psyco not only gives a better boost, but is much easier to
use.

 I haven't touched pysync for a while, but it should still work with the
 latest librsync as the API hasn't changed. If there are any problems,
 please let me know. I believe rdiff-backup also has a python wrapper for
 librsync that might be more advanced than the one in pysync.
 
 I have plans for both pysync and librsync, but I haven't worked on them
 much lately. I find I am mostly motivated by feedback from others when
 funding is not available :-)

This little bit of interest motivated me to have a look at it again, and
I've just released version 2.24. From it's NEWS:

Updates between release 2.16 and 2.24
-
 
  * Added TODO and NEWS files.
 
  * Changed to use psyco if available, giving a 33% speedup.
   
  * Updated to use librsync 0.9.6.
   
  * Changed to using a faster md4sum implementation based on the
  librsync implementation, modified to use the RSA API.
   
  * Added rollin/rollout support to historical adler32.py.
   
  * Minor cleanups to rollsum code.
   
  * Minor tweaks to handling of block fragment matching.


-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: rsync on OpenVMS

2003-10-13 Thread Donovan Baarda
On Mon, 2003-10-13 at 13:00, John E. Malmberg wrote:
 jw schultz wrote:
  On Sun, Oct 12, 2003 at 12:38:40AM -0400, John E. Malmberg wrote:
[...]
  I do not know but if OpenVMS support is a problem for rsync
  proper you might wish to look at pysync or unison which
  might meet your immediate needs.
 
 I have not heard of unison.  I have heard that pysync was successful in 
 a limited test on OpenVMS.  As near as I can tell though the librsync it 

cool... someone playing with pysync :-)

 is based on is a bit out of date.

pysync includes a pure-python implementation that will run on anything
that can run python, as well as a basic wrapper for librsync. It also
includes extension modules for md4sum and rollsum that speed it up a
fair bit.

Something possibly worth trying on it is psyco... it compiles python to
native code on the fly using a simple import psyco. Pure python is a
bit slow compared to native C implementations, but psyco could help
close the gap a bit.

librsync is still under active development by myself and others, and
recently had a new release. 

I haven't touched pysync for a while, but it should still work with the
latest librsync as the API hasn't changed. If there are any problems,
please let me know. I believe rdiff-backup also has a python wrapper for
librsync that might be more advanced than the one in pysync.

I have plans for both pysync and librsync, but I haven't worked on them
much lately. I find I am mostly motivated by feedback from others when
funding is not available :-)

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] Re: state of the rsync nation?(revisited 6/2003 from 11/2000)

2003-08-03 Thread Donovan Baarda
On Mon, 2003-08-04 at 11:05, Martin Langhoff wrote:
 Donovan Baarda wrote:
 
  For the record, librsync version 0.9.6 is _almost_ ready in CVS. A bug
 
 has been detected but not isolated yet for large files (2G+). If it's
 not squashed soon I'm tempted to release anyway with a KNOWN_BUGS that
 reports this.

Last night I found and squashed a bug that caused the problem. I'm
waiting on feedback from the user that reported it to confirm it is now
fixed.

 Donovan,
 
 I am just looking around for hints on compiling librsync under cygwin 
 (gcc), and happened to find this post of your promising 0.9.6, which 
 runs under cygwin... 
 http://groups.google.com/groups?hl=enlr=ie=UTF-8threadm=b7jecs%242sdr%241%40FreeBSD.csie.NCTU.edu.twrnum=2prev=/groups%3Fq%3Dcygwin%2Blibrsync%26hl%3Den%26lr%3D%26ie%3DUTF-8%26sa%3DN%26tab%3Dwg

I know that post was ages ago, but due to work commitments I never got a
chance to release it. Every time I got a small chance to look at it
again, I kept finding little niggly things that really should have been
fixed before release.

The CVS is only intermittently updated, so I only bother tagging on
release, which still hasn't happened yet. The CVS HEAD is the current
release candidate. I don't commit unless it passes a make distcheck so
it should never be broken. 

 Are the sources at least tagged, so I can get the mfrom CVS? What tags?

Your timing is impeccable. Just last night I finalised the MSVC updates
and fixed the last known bug. That means there is nothing outstanding
for the release... this should be 0.9.6, but I'm giving it a day or so
to get feedback before I tag and release.

There is a delay on SF between developer and anon CVS, so it is possible
the final checkins are not yet visible. In the meantime, there is a
tarball available at;

http://minkirri.apana.org.au/~abo/projects/librsync/librsync-0.9.6.tar.gz

This compiles and builds on win32 using cygwin and/or MSVC.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Large files and symlinks

2003-07-30 Thread Donovan Baarda
On Thu, 2003-07-31 at 06:53, jw schultz wrote:
[...]
 In many cases invoking --partial is worse than not.  If you
 are rsyncing a 4GB file and transfer is interrupted after
 500MB has been synced you get a 500MB file which now has
 less in common with the source than the 4GB file did.  

A more useful behaviour for --partial would be to concatinate the
partial download to the end of the old basis, rather than replace
it... this leaves you with a much more useful partial result to resume
from.

Of course this behaviour could be _very_ confusing to people... :-)

-- 
Donovan Baarda [EMAIL PROTECTED]

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Large files and symlinks

2003-07-30 Thread Donovan Baarda
On Thu, 2003-07-31 at 10:01, jw schultz wrote:
 On Thu, Jul 31, 2003 at 09:22:51AM +1000, Donovan Baarda wrote:
  On Thu, 2003-07-31 at 06:53, jw schultz wrote:
  [...]
   In many cases invoking --partial is worse than not.  If you
   are rsyncing a 4GB file and transfer is interrupted after
   500MB has been synced you get a 500MB file which now has
   less in common with the source than the 4GB file did.  
  
  A more useful behaviour for --partial would be to concatinate the
  partial download to the end of the old basis, rather than replace
  it... this leaves you with a much more useful partial result to resume
  from.
  
  Of course this behaviour could be _very_ confusing to people... :-)
 
 Interesting idea.  I don't know that it would be all that confusing.
 
 You'd have to truncate the basis to the length of the source
 to prevent it growing with each failure.  Even appending
 3.5GB to a 4GB file once is problematic.

The simplest solution is to write the partial download over the
beginning of the old file, leaving the end part as it was.

This way you are making the sensible assumption that most of the matches
from the start of the file match the partial download, and the remainder
will match the rest when you resume. A match locality heuristic?

I suspect this will be less confusing for end users too... the partial
download will have it's size unchanged, and when you look at the data
you will be able to see that it synchronised up to point xxx. It will
look like a partial in-place update.

 If we were to append to the existing file it might make
 sense to append only those portions that were updates.
 That would require keeping track of the offset+length of
 each change block.  Yuck, that is much more work that it is
 worth.

Yeah... very overkill.

 One idea that i think has real merit would be to combine
 some kind of change-rate score with an evaluation of
 comparative sizes of the tempfile and the original file to
 decide if replacing the original or leaving it would be more
 efficient.  If there was no data-reuse then replacement
 would be in order.  If there was a high rate of reuse it
 wouldn't.  If the reuse was middling you would consider the
 comparative sizes.  The formula would probably be pretty
 simple.  If someone comes up with a patch that does that i'd
 be willing to entertain it.

I'm not convinced this would be worth the effort... I'm sure in many
cases the beginning of the file is where most of the changes are, so
throwing away the end on the basis of poor matches at the start is a bad
idea.

-- 
Donovan Baarda [EMAIL PROTECTED]

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Rolling Checksum Algorithms

2003-07-20 Thread Donovan Baarda
On Sun, Jul 20, 2003 at 01:06:11PM +0100, [EMAIL PROTECTED] wrote:
 Hi,
 
 Where can I get good pointers on the rolling checksum algorithm used in rsync? 
 I need an 8-bit or 12-bit rolling checksum too.  Any place where rolling
 checksum algorithms are discussed?

You can have a look at the rsync white paper. 

You can also look at pysync for sample implementations in Python;

http://freshmeat.net/projects/pysync/

I think I've posted on this list some comments about various variations on
the rolling checksum. I think the rsync whitepaper credits it as being a
variation on the Adler checksum, but I think it is closer to something I've
seen called a Fletcher checksum.

Pysync has a Python implementation of a rolling adler32 checksum, and has
comments about the variations in rsync, librsync, and xdelta.

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: New wildmatch code in CVS

2003-07-09 Thread Donovan Baarda
Quoting Wayne Davison [EMAIL PROTECTED]:

 If you've been watching CVS, you may have noticed that I checked in
 some
 new files named wildmatch.c and wildmatch.h.  This code implements the
 shell-style wildcard matching with rsync's extension that ** matches
[...]
 build farm.  One thing I discovered is that the various fnmatch()
 calls
 all seem to handle the character-class boundary cases in conflicting
 ways (things like []-]).  So, one benefit for rsync of switching
 from
 fnmatch to wildmatch will be to make the callers of this function
 behave
 consistently on all platforms (which primarily affects the exclude
 code).
[...]

This might explain why Python implements its own fnmatch.py using regex's.

 Anyone have any concerns or comments on switching over to this new
 code?

Only one concern, and few questions, and a maybe suggestion;

The concern: 

Why the name wildmatch? It seems a bit too arbitary to me. I have used the 
name efnmatch (extended fnmatch) for it in my Python implementations. The 
name wildmatch is too generic, whereas efnmatch clearly indicates it is an 
exension to the standard fnmatch. A silly concern I know, but it will make my 
life easier when I start making Python extension modules out of your code to 
use in mine :-)

Some Questions:

How did you implement it (I know, I should just look in CVS, but while I'm 
typing...)? Does it use regexes or a modified implementation of fnmatch? How 
does it compare performance-wise with a regex based implementation?

The reason I'm curious is Python, for whatever reason, implements fnmatch in 
Python using regex's rather than using a C python extension (possibly to avoid 
the fnmatch variations you identified). I'm wondering if it would be worth re-
implemnting fnmatch (and efnmatch) as C extension modules.

The maybe suggestion:

I found by implementing efnmatch using regex's, it was painless to add the 
ability to use regex's in include/exclude lists. This meant include/exclude 
lists could be built using either efnmatch wildcards or regex's, as they would 
all be converted, compiled, and matched as regex's anyway.

I don't know how regex matching compares to fnmatch matching performance-wise. 
I'm also aware that people have expressed concerns about linking in/against 
largish regex lib's. However, if the option of using regex's for 
include/excludes is ever going to happen, then it might be an idea to use them 
for this.

Personally, I feel the efnmatch functionality is flexible enough to never 
require regex's, but I've seen a few enquiries in the past..

ABO
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] Re: state of the rsync nation? (revisited6/2003 from 11/2000)

2003-06-11 Thread Donovan Baarda
On Wed, 2003-06-11 at 16:13, Martin Pool wrote:
 On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote:
  On Wed, 2003-06-11 at 13:59, Martin Pool wrote:
[...]
   On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote:
  I forget if I saw this in Tridge's thesis, but I definitely noticed that
  librsync uses a modified zlib to make feeding data to the compressor and
  throwing away the compressed output more efficient. I have implemented
  this in pysync too, though I don't use a modified zlib... I just throw
  the compressed output away.
 
 Yes, I remember that, but that's not rzip.

Just had a look at rzip... interesting... haven't yet digested its
impact on things like xdelta yet though... hence the rambling
explorations in this email.

 By the way the gzip hack is an example of a place where I think a bit
 of extra compression doesn't justify cluttering up the code.  I think
 I'd rather just compress the whole stream with plain gzip and be done.

I originally had the same thoughts... but then testing on real-world
data showed me a 5~10% improvement on compression when using it, so I
included it in pysync. priming the compressor with match data does
make a difference when you have a lot of match data interspersed with
your miss data.

However, I agree that it is probably much less efficient than using a
compression scheme designed around the match encoding rsync already
uses.

 See http://samba.org/~tridge/phd_thesis.pdf pg 86ff
 
 rzip is about using block search algorithms to find widely-separated
 identical blocks in a file.  (I won't go into detail because tridge's
 explanation is quite clear.)
 
 I am pretty sure you could encode rzip into VCDIFF.  I am not sure if
 VCDIFF will permit an encoding as efficient as you might get from a
 format natively designed for rzip, but perhaps it will be good enough
 that using a standard format is a win anyhow.  Perhaps building a
 VCDIFF and then using bzip/gzip/lzop across the top would be
 acceptable.

I'm sure you could encode rzip into vcdiff, and you would loose very
little, and possibly gain a bit (if you add byte-extending matches
backwards, matches are never byte aligned, and vcdiff can efficiently
encode any short or long offsets _and_ lengths).

 In fact rzip has more in common with xdelta than rsync, since it works
 entirely locally and can find blocks of any length. 

Yeah, I noticed it was very similar in some ways... though xdelta is a
bit more brute force in that it builds a massive hash table for the
whole file using a small block size. rzip is a very nice way to minimise
the size of the hash table without sacrificing too much. 

Also Tridge doesn't have byte-extending matches backwards, which xdelta
does. rzip can't extend matches backwards because the exponent based
offset encoding can't support arbitary byte aligned long offsets. 

Right now I'm thinking the big innovation rzip provides is not the
exponent based offset coding, but a neat way of minimising the hash
table.

 rzip's advantage compared to gzip/bzip2 is that it can use compression
 windows of unlimited size, as compared to a maximum of 900kB for
 bzip2.  Holding an entire multi-100MB file in memory and compressing
 it in a single window is feasible on commodity hardware.

I just realised you already said what I typed above :-)

  The self referencing compression idea is neat but would be a...
  challenge to implement. For it to be effective, the self-referenced
  matches would need to be non-block aligned like xdelta, which tends to
  suggest using xdelta to do the self-reference matches on top of rsync
  for the block aligned remote matches. Fortunately xdelta and rsync have
  heaps on common, so implementing both in one library would be easy (see
  pysync for an example).

After looking at rzip, I'm convinced xdelta needs to use rzip's
expiring hash table approach to minimise the hash table xdelta is
evil on gigabyte size files.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: questions about librsync

2003-06-11 Thread Donovan Baarda
On Thu, 2003-06-12 at 04:12, Jeff Frost wrote:
 I'm not sure if this is the appropriate forum for questions regarding 
 librsync, but couldn't find any others.
 
 I'm trying to get librsync working properly on Solaris 2.7 and 2.8 Sparc 
 servers.  The problem is that while librsync appears to compile cleanly, make 
 check fails the sources.test.  Does anyone have any insight as to why this 
 might be?  Might I need a specific version of some other library?
 
 Any help will be much appreciated.

For starters, the best place to discuss librsync is on the
librsync-devel or librsync-users list on SourceForge;

http://sourceforge.net/mail/?group_id=56125

Next, make sure you are using the CVS version of librsync from  the
librsync project on SourceForge.

If you are still having problems, I'd be interested to know... we are
trying to get librsync ready for a 0.9.6 release.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: state of the rsync nation? (revisited 6/2003 from 11/2000)

2003-06-10 Thread Donovan Baarda
On Tue, 2003-06-10 at 14:25, Martin Pool wrote:
 On  8 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote:
 
  The next big thing in delta calculation is probably going to be the
  vcdiff encoding format, which should allow a common delta format for
  various applications and supports self-referencing delta's, which
  makes it capable of compression. According to the xdelta project this
  has already been implemented, and I'm keen to see Josh's code, as it
  could be used as the basis for a cleanup/replacement of at least the
  patch component of librsync.
 
 Do you have a link for this?  Josh plays his cards pretty close to his
 chest.  The XDelta page seems to be even more inactive than librsync
 :-/

yeah, the xdelta page on SourceForge has a few announcements about
vcdiff support that are fairly recent, but everything else hasn't been
touched for ages. It looks like Josh is only using SF as a static
front-page where he posts announcements and all the real work is
happening somewhere else...

Unfortunately I don't know where that somewhere else is... hence my
eagerness to actually see the code.

The vcdiff standard is available as RFC3284, and Josh is listed as one
of the authors. 

I also had some correspondence with Josh ages ago where he talked about
how self-referencing delta's can directly do compression of the miss
data without using things like zlib and by default gives you the
benefits of rsync's context compression without the overheads (rsync
runs a decompressor _and_ a compressor on the receiving end just to
regenerate the compressed hit context data).


-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: [librsync-devel] Re: state of the rsync nation? (revisited6/2003 from 11/2000)

2003-06-10 Thread Donovan Baarda
On Wed, 2003-06-11 at 13:59, Martin Pool wrote:
 On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote:
 
  The vcdiff standard is available as RFC3284, and Josh is listed as one
  of the authors. 
 
 Yes, I've just been reading that.
 
 I seem to remember that it was around as an Internet-Draft when I
 started, but it didn't seem clear that it would become standard so I
 didn't use it.

I'm not sure if this is the same one... I vaguely recall something like
this too, but I think it was an attempt to add delta support to http and
had the significant flaw of not supporting rsync's
delta-from-signature. It may have come out of the early xdelta http
proxy project. IMHO rproxy's http extensions for delta support were
better because they were more general.

There was also another thing I saw which was a compact delta
representation spec that I think librsync uses (perhaps it was you who
had some discussion about it in the old librsync TODO?), and may have
influenced the vcdiff RFC, but AFAIK was never official in any way.

  I also had some correspondence with Josh ages ago where he talked about
  how self-referencing delta's can directly do compression of the miss
  data without using things like zlib and by default gives you the
  benefits of rsync's context compression without the overheads (rsync
  runs a decompressor _and_ a compressor on the receiving end just to
  regenerate the compressed hit context data).
 
 Something possibly similar is mentioned in tridge's thesis.  I was
 talking to him a while ago and (iirc) he thought it would be good to
 try it again, since it does well with the large amounts of memory and
 CPU time that are available on modern machines.

I forget if I saw this in Tridge's thesis, but I definitely noticed that
librsync uses a modified zlib to make feeding data to the compressor and
throwing away the compressed output more efficient. I have implemented
this in pysync too, though I don't use a modified zlib... I just throw
the compressed output away.

The self referencing compression idea is neat but would be a...
challenge to implement. For it to be effective, the self-referenced
matches would need to be non-block aligned like xdelta, which tends to
suggest using xdelta to do the self-reference matches on top of rsync
for the block aligned remote matches. Fortunately xdelta and rsync have
heaps on common, so implementing both in one library would be easy (see
pysync for an example).

If I didn't have paid work I would be prototyping it in pysync right
now. If anyone wanted to fund something like this I could make myself
available :-)

 I strongly agree with what you said a while ago about code simplicity
 being more valuable than squeezing out every last bit.

Yeah, my big complaint about librsync at the moment is it is messy. Just
cleaning up the code alone will be a big improvement. I would guess that
at least 30% of the code could be trimmed away, leaving a cleaner and
more extensible core, and because messy leads to inefficient, it
would be faster too.

-- 
Donovan Baarda [EMAIL PROTECTED]
http://minkirri.apana.org.au/~abo/

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: state of the rsync nation? (revisited 6/2003 from 11/2000)

2003-06-07 Thread Donovan Baarda
 as a test bed for rsync 3 type development
(superlifter?).

For the future I can see continued support of the exising rsync code. I
would also like to see librsync adopt vcdiff as it's delta format, and
get a major cleanup, possibly by re-using some xdelta code. There are
many common elements to the xdelta and rsync algorithms, and I see no
reason why a single library couldn't support both (as pysync does). It
would be nice if librsync and/or xdelta could become _the_ delta
library.

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: MD4 checksum fix

2003-04-05 Thread Donovan Baarda
On Sun, 2003-04-06 at 10:06, [EMAIL PROTECTED] wrote:
   I agree, they should be done together.  I don't have my original
   patch but I can reimplement it with the correct remote_version
   dependence and send it in the next couple of days (by Thursday
   evening).  My intent is the minimal set of changes, rather than
   changing the internals.
  
  That is a workable timeframe, thanks.
 
 Two days late, but here is the patch against 2.5.6.  This implements the
 following changes:
 
  - for protocol version = 27, mdfour_tail() is called when the block size
(including checksum_seed) is a multiple of 64.  Previously it was not
called, giving the wrong MD4 checksum.
 
  - for protocol version = 27, a 64 bit bit counter is used in mdfour.c as
required by the RFC.  Previously only a 32 bit bit counter was used,
causing incorrect MD4 file checksums for file sizes = 512MB - 4.

It looks like it will work OK, but it's kinda ugly in that starts
embedding version stuff into the mdfour implementation. Still... its
better than the nothing I've produced :-)

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: MD4 checksum fix

2003-04-05 Thread Donovan Baarda
On Sun, 2003-04-06 at 16:29, [EMAIL PROTECTED] wrote:
  It looks like it will work OK, but it's kinda ugly in that starts
  embedding version stuff into the mdfour implementation. Still... its
  better than the nothing I've produced :-)
 
 Yes, it's certainly not elegant.  An alternative would be to
 have two different sets of MD4 routines and then have more
 logic in checksum.c to call the correct routines.  That makes
 for more changes to checksum.c but avoids a library depending
 upon a global variable.

I don't think you need two whole implementations of MD4 routines, just a
replacement _tail function and replacement _result function to use it.
Note that truncating the bit count to 32 bits can be done in the _tail
function.

It might even turn out easier to just write a _result replacement that
munges the mdfour struct (ie, truncs the byte count and empties the tail
buffer) before calling the correct _result implementation.

These can be put into the checksum.c with the appropriate protocol
version logic. That way the mdfour.c implementation becomes a fixed
stand-alone md4 implementation (that can be replaced with faster/cleaner
standard implementations further down the track).

 Or we could add a new function in mdfour.c, eg: mdfour_broken(),
 that gets called once at startup if remote_version  27.  This
 would avoid the need for md4 to pull in remote_version.

This still puts protocol version specific code into mdfour.c, making it
harder further down the track to drop in a replacement implementation or
link against a standard md4 library implementation.

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: MD4 checksum fix

2003-03-31 Thread Donovan Baarda
On Tue, 2003-04-01 at 09:35, jw schultz wrote:
 On Tue, Jan 28, 2003 at 11:22:14PM -0800, Craig Barratt wrote:
  And I have several things I would like to work on and submit:
  
   - Fix the MD4 block and file checksums to comply with the rfc
 (currently MD4 is wrong for blocks of size 64*n, or files
 longer than 512MB).
[...]
  But before I work on these I would like to make sure there is interest
  in including them.
 
 I've not heard from you on the adaptive checksum length patch.
 I shall be committing it shortly subject to objections or
 further discussion.

It looked OK to me from an implementing what we discussed point of
view... I noticed you used the test square for each bit algorithm
instead of the Newton-Raphson implementation for the square-root. It's
certainly more complex and I don't know if it is actually faster, but it
does avoid divide ops.

I don't know enough about the rest of the code to comment on whether it
will break the protocol though.

 I would like to see the MD4 checksums fixed as well.  We are
 very close to the upper limit on protocol versions for
 deployed versions of rsync.  Therefore, i would like to
 minimize protocol increments for a while.  In any other
 circumstance i wouldn't suggest doing so but i think it
 would be a good idea to integrate these two fixes in one
 protocol bump.
 
 If you, or someone else, has the fix for the MD4 sums handy
 i would be glad to coordinate in implementing both sets of
 patches at about the same time.  If you want to send it to
 me, that would be fine.  I can also hold off for a short
 while for the MD4 sum patch to get some testing.

librsync includes a fixed and optimized md4sum implementation in CVS on
sourceforge. It should be compatible with the rsync implementation, as
that's where it originally came from and the API has not (yet) changed,
despite the implementation being fairly overhauled.

However, I also have another even more refined (20% less code, and
faster) md4sum implementation based on this, but the API has changed to
conform with the RSA implementation for inclusion in libmd. I hope to
change librsync to use this API after releasing 0.9.6. This
implementation is available here;

http://minkirri.apana.org.au/~abo/projects/libmd/

If you were going to start changing the md4sum code, I would encourage
migrating to the RSA API as it _seems_ to be more widely used (certainly
in BSD land) and would allow linking against the libmd library where it
is available.

Unfortunately, backwards compatibility probably means you will need to
include a broken implementation as well as the fixed implementation. The
best way I can think to do this is to provide an additional broken
MD4Final (or rs_mdfour_result using the old librsync API) that
implements the old behaviour. All the brokenness was in the finalizing
of the md4sum so this single alternative MDFinalBroken routine should
be all you need.

If people ask really nice and are not in a hurry, I could probably whip
such a function up to supplement the libmd API implementation. Someone
else would have to do the smarts of protocol version juggling and
supporting the changed API.

I'm not particularly interested in supporting the old rsync/librsync
mdfour API.

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: MD4 checksum fix

2003-03-31 Thread Donovan Baarda
On Tue, 2003-04-01 at 13:34, jw schultz wrote:
 On Tue, Apr 01, 2003 at 12:41:39PM +1000, Donovan Baarda wrote:
  On Tue, 2003-04-01 at 09:35, jw schultz wrote:
[...]
 Comparative complexity is a matter of perspective.  I think
 it is a bit more deterministic.  It has been a few years

I was talking in simple LOC what-the-hell-does-this-do terms. As this is
not heavily exercised (once per file), it doesn't really matter.

  I'm not particularly interested in supporting the old rsync/librsync
  mdfour API.
 
 I can understand that.  I do think the API change and the md4
 bug fix should be done separately.  The API change is really
 internal, it is the bug fix that affects the protocol.

I'm not 100% sure how close the current mdfour librsync API is to the
rsync API. If they are the same, you should just be able to drop it in
as a replacement, and grab the broken implementation of rs_mdfour_tail
to make a new rs_mdfour_broken_result

Hmmm... if someone else was going to do all the rest of the integration
work, I _could_ do this bit. I'd rather do it for the RSA API, but it
wouldn't be too much work to change this bit over later on.
 
 All i'm offering here is to coordinate with someone fixing
 the bug or to take the fix (if it can be agreed upon) from
 someone who can encapsulate it into a patch.  And that just

I don't really have time to do this myself, and don't have the
familiarity with the wider rsync code to do it. I've been focusing on
librsync (and will continue to do so).

-- 

Donovan Baardahttp://minkirri.apana.org.au/~abo/


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: [RFC] dynamic checksum size

2003-03-23 Thread Donovan Baarda
On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote:
 On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote:
  On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote:
   On Sun, 2003-03-23 at 07:40, jw schultz wrote:
[...]
 The block_size heuristic is pretty arbitary, but the blocksum_size
 calculation is not, and calculates the minimum blocksum size to achieve a
 particular probability of success for given file and blocksum sizes. You can
 replace the block_size heuristic with anything, and the second part will
 calculate the required blocksum size.
 
  I do have a variant that scales the length devisor for block
  size from 1024 to 16538 as the length progresses from 1MB to
  128MB.  This in conjunction with your sum2 length formula
  produces this:
 
 That sounds pretty arbitary to me... I'd prefer something like having it
 grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for
 256M, 32K for 1G, etc.

A thought occurred to me after writing this; a viable blocksize heuristic is
just a fixed block size. This makes the signature size almost proportional
to the filesize, except for the growth in blocksum size.

I don't necisarily advocate it though. I think increasing the blocksize is a
good idea as files grow because file and signature size also contribute to
CPU load.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: [RFC] dynamic checksum size

2003-03-23 Thread Donovan Baarda
On Mon, 2003-03-24 at 03:36, jw schultz wrote:
 On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote:
  On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote:
[...]
 CHUNK_SIZE is used in mapping the file into memory.  I'm
 not absolutely sure (haven't spelunked that part of the code) what
 would happen if block size were to exceed it.  I suspect it
 would be OK because i don't recall hearing anyone complain
 as a result of using the --block-size option with a huge
 value.

Did the old code limit it to 16K even after people manually set it to
something larger?

 The problem with growing the block size indefinitely is that
 you must balance the cost of the checksums with the cost of
 sending large blocks for small changes.

Yeah, but you also need to keep it in perspective; if you are syncing a
4G file, do you really care much that it transmitted a 64K block instead
of a 16K block? And that's assuming that you could actually have sent a
16K block instead.

The ideal block size to minimimize data transfer is more based on the
nature of the changes than the file size. However, given no further info
than the size of the file, I would guess it's a fair heuristic to assume
a large file will have large changes, which means larger blocks don't
loose you much.

 How many 260GB files do you have?   On the other hand, with
 the fixed block length we see a diminishing advantage.
 We also take a hit with the hashing, match lookups and
 memory footprint.

I think this is more of a concern than the cost of sending slightly
larger blocks.
 
  That sounds pretty arbitary to me... I'd prefer something like having it
  grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for
  256M, 32K for 1G, etc.
 
 I most strongly lean toward a bounded square-root of file
 size.  It produces a nice smooth curve on resource
 consumption diminishing load/MB as the files grow.  I just

...until you hit the boundary. I dislike boundaries and step functions
because they break those nice smooth curves :-)

 hadn't a decent fast integer sqrt function at hand.  I don't
 really care to start bringing in the math lib just for this
 one calculation.  I've cobbled together one that is OK.
[...]

In case you didn't notice the count non-zero shifts is an approximate
integer log2 function. There are integer approximations for sqrt that
you can use, including Newton-Raphson;

int sqrt(int R) {
  int x,x1;

  x = 1024; /* starting guess close to what we expect */
  repeat {
x1 = x;
x = (x1 + R/x1)  1;
  } while (x-x1);
  return x;
}

 Clearly we are getting into the realms of sci-fi here.  Even
 with the 16KB block length limit files in the 1-16GB range
 should be manageable on almost any system that is likely
 to have them.

If we must have an upper limit on block size, I'd prefer it to be
continuous up to the 4G mark. This means an upper limit of 64K.

 The s2length calculation is using the algorithm you provided:
 
   #define BLOCKSUM_EXP 10

Might be worth putting a comment here;

/* The following implements the function;
  
  blocksum_bits = BLOCKSUM_EXP + 2*log2(len) - log2(block_len)

which gives a probability of rsync algorithm corrupting data and falling
back to whole file transfer of 1/2^BLOCKSUM_EXP */

This is something future generations might want to to tweak or perhaps
grok :-)

 b = BLOCKSUM_EXP;
 l = len;
 while (l = 1) {
 b += 2;
 }
 c = block_len;
 while (c = 1  b) {
 b--;
 }
 s2length = (b + 1 - 32 + 7) / 8;
 s2length = MAX(s2length, 2);

Probably want to add;

  s2length = MIN(s2length, MD4SUM_SIZE);

just to be on the safe side :-)

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: [RFC] dynamic checksum size

2003-03-23 Thread Donovan Baarda
On Mon, 2003-03-24 at 16:33, jw schultz wrote:
 On Mon, Mar 24, 2003 at 10:56:42AM +1100, Donovan Baarda wrote:
  On Mon, 2003-03-24 at 03:36, jw schultz wrote:
[..]
   hadn't a decent fast integer sqrt function at hand.  I don't
   really care to start bringing in the math lib just for this
   one calculation.  I've cobbled together one that is OK.
  [...]
  
  In case you didn't notice the count non-zero shifts is an approximate
  integer log2 function. There are integer approximations for sqrt that
  you can use, including Newton-Raphson;

Actually, Newton-Raphson is not really an approximation (when compared
to count non-zero shifts), but you can turn it into one by stopping
iteration when the answer is close enough.

  int sqrt(int R) {
int x,x1;
  
x = 1024; /* starting guess close to what we expect */
repeat {
  x1 = x;
  x = (x1 + R/x1)  1;
} while (x-x1);
return x;
  }
 
 I'm more concerned with cycle count and cache effects than
 loop iterations.  Division, at least historically, is an
 expensive operation compared to multiplication.

The above converges surprisingly quickly... for arbitrary numbers up to
2^36 it typically converges within 10 iterations. For numbers under 2^20
it seems to converge within 5 iterations. Also, this is only done once
per file to determine the block-size to use.

If you are really concerned about divisions, there is another algorithm
that uses only multiplies, and does one iteration per bit in the
solution... for a 64bit integer it is 32 iterations to find the
solution. It's nice for implementing in assembler on micros without a
hardware division, but it's a bit messier to implement in C.

  If we must have an upper limit on block size, I'd prefer it to be
  continuous up to the 4G mark. This means an upper limit of 64K.
 
 I'm ambivalent.  We don't seem to have a problem with
 huge block sizes.  One possibility would be to allow the user to
 set a ceiling by adding an option (--max-block-size)
 
 My inclination at the moment (may change my mind) would be
 to have no upper bound and see if anyone else has a problem.
 If problems appear we could either impose a limit or add a
 --max-block-size.

I think --max-block-size would be a waste of time... people can always
specify a particular block size if they are unhappy with the heuristic.

I suspect that no-limit will prove to work well.

 On the whole i'd say we are zeroing in on something good.
 I'll see about regenerating the patches.

Yeah... I think I'll have to implement this heuristic in pysync now :-)

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: [RFC] dynamic checksum size

2003-03-22 Thread Donovan Baarda
On Sun, 2003-03-23 at 07:40, jw schultz wrote:
[...]
 The two attached patches implement per-file dynamic checksum
 lengths.

Nice one.
 
 Here is a quick table showing the file sizes at which
 transition occurs, the sum2 length and the jump in size of
 the block sum table transmitted:
 
   filesize sum2lentransition
   125MB  2
125MB  3 8KB
  1GB  432KB
  4GB  5   128KB
 16GB  6   523KB
 64GB  7 2MB

My reading of the patch doesn't quite match with the above. The block
size heuristic is 1/1 the filesize, limited between some miniumum
and maximum. Ignoring the limits, this means the number of blocks is
kept at a constant 1, leaving you stuck at three bytes.

 This heuristic is a stab in the dark.  It may be that it is
 too aggressive in the rate of increase or insufficiently
 aggressive at starting the increase.  It may be that a 8
 bits of sum2 in conjunction with the 32 bit adler is enough
 for smaller files.  We may also wish to revisit the
 heuristic for block sizes that fixes the block count at 1000
 for files between 700KB and 16MB, or coordinate them so that
 the transitions would be easier.  Both heuristics can be
 changed at will without breaking compatibility.

I posted some more maths on this a while ago to the old rproxy list to
someone inquiring about how this affects librsync. In summary, a
conservative simplification of the relationship between block size,
blocksum size, and file size is;

  p = (s^2)/(B.(2^b))

where:
  
  p is the probablility of a false match (ie, corrupt result)
  s is the file size.
  B is the block size.
  b is the number of bits in the block checksum (blocksum size)

Note that blocksum size is the total blocksum including the rollsum. If
you plug in an acceptable probability of failure, you get a direct
relationship between file size, block size, and blocksum size. Assuming
a 1/2^n probability of failure is OK, you get;

B = (s^2)/(2^(b-n))

b = n + 2.log2(s) - log2(B)

Note that assuming a fixed blocksum size, the block size must grow at
the square of the file size. Assuming a fixed bock size, the blocksum
must grow at the log of the filesize.

I think it makes sense to grow the block size linearly with file size
(which keeps a constant number of blocks), and then have blocksum size
grow as necessary to make up the difference by default. It might also be
good to have the ability to specify either blocksum size or block size
and have the other estimated using the above equations. Of course you
will want to round up the blocksum size to the nearest byte.

Using a heuristic that aims for 2^m number of blocks and plugging this
in to the above equations gives;

B = s/(2^m)
b = n + log2(s) + m   (Note: 32bits of this is the rollsum)

Assuming 8K number of blocks (m=13), and an acceptable error rate of
less than 1 in 1024 file transfers falling back to whole file (n=10),
you get;

file size   block size  blocksum size
8M 1K  46 (sum2len=2)
32M4K  48 (sum2len=2)
128M   16K 50 (sum2len=3)
512M   64K 52 (sum2len=3)
2G 256K54 (sum2len=3)
8G 1M  56 (sum2len=3)
32G4M  58 (sum2len=4)
128G   16M 60 (sum2len=4)

As you can see from this, your heuristic is very conservative for large
files, but optimistic for smaller files. Might I suggest the following
code fragments;

#define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
#define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */

if (block_size) {
  block_len = block_size;
} else {
  block_len = (len  BLOCKLEN_EXP)  ~15; /* rough heuristic */  
  block_len = MIN(block_len, BLOCK_SIZE);
  block_len = MAX(block_len, (CHUNK_SIZE / 2));
}

size_t  b = BLOCKSUM_EXP;
size_t  l = len;
while (l  1) {
b+=2;
}
size_t  c = block_len;
while (c  1) {
  b--;
}
sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html


Re: [Apt-rpm] I: [PATCH] 0.5.4cnc9: rsync method support

2003-02-21 Thread Donovan Baarda
On Fri, 2003-02-21 at 18:42, Sviatoslav Sviridov wrote:
[...]
 And other thing I want to discuss: now rsync designed as standalone program,
 but it will be good if library with protocol implementation created, and build
 rsync program based on this library.

Hmmm, sounds like http://librsync.sourceforge.net/.

The current rsync implementation is tightly coupled and difficult to
split out. There have been various attempts to do this by starting from
scratch. librsync is probably the most advanced rsync algorithm
implementation, and there have been various experiments with alternative
protocols. Someone even announced a perl implementation that is rsync
compatible (impressive)!

However, librsync lags behind rsync in various ways... the most
important of which is active development.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync vs. rcp

2003-02-19 Thread Donovan Baarda
On Thu, 2003-02-20 at 05:55, va_public  wrote:
 I got used to rsync's -v --progress option so much that I used it 
 instead of rcp even to simply copy files across the network. I dont 
 like software that doesnt talk to me! :-) I like the percentage bar 
 that --progress gives!
 
 To my surprise, I found that, when dealing with 1GB+ files, rsync is 
 4-5 _times_ slower than rcp.

RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently
large block size. See the following;

http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html

This probably needs to be documented somewhere newbie's can find it.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync vs. rcp

2003-02-19 Thread Donovan Baarda
On Thu, 2003-02-20 at 08:55, James Knowles wrote:
  RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently
  large block size. 
 
 According to the archives, block size doesn't fix anything. At any rate, I'm 
 highly disappointed that rsync is relying on statistical good fortune.

increasing block size does increase the probability that the rsync
algorithm will succeed.
 
 We've used rsync extensively in our company for moving information around. 
 With tens of gigabytes of data at stake, it's very disappointing. Now we 
 have to dump rsync, redo all of our backups, and possibly write our own 
 tool. I don't want to futz around with something that I can no longer trust.

If you read the thread, rsync the tool is reliable, just the rsync
algorithm does not work (as currently implemented). The rsync tool will
fall back to a whole file transfer if the rsync algorithm fails, as
indicated by a full md4sum of the whole file. This does mean rsync the
tool will end up transferring the file twice when this occurs; once
using the rsync algorithm, and again sending the whole file.

As for rsync relying on statistical good fortune, that should have been
clear from the beginning. However, because of the whole file checksum
used to verify the final transfer, there is a 1/2^128 chance of it
getting it wrong per file. That is less than 1/10^36 chance, or 1 in
so-many-billions-of-billions-of-billions-that-its-not-worth-worrying-about.

So if you spend from now until eternity transferring every file that
ever existed again and again... you might get one file wrong, but not
very likely.

I think there is a higher probability of getting corruption doing a 'cp'
directly onto the same HDD (cosmic rays etc).

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync vs. rcp

2003-02-19 Thread Donovan Baarda
On Thu, 2003-02-20 at 08:53, va_public  wrote:
 --- In [EMAIL PROTECTED], Donovan Baarda [EMAIL PROTECTED] wrote:
  On Thu, 2003-02-20 at 05:55, va_public  wrote:
  
  RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a 
 sufficiently
  large block size. See the following;
  
  http://www.mail-archive.com/rsync@l.../msg05219.html
 
 OK. I read the thread. Pretty interesting design discussion on rsync 
 internals.
 
 Has any of it been implemented in 2.5.6? When is it planned to 
 implement it?
 
 So, bottom line, you are telling me, dont use rsync for my Oracle 
 database backups?
 
 Thats a bummer. Oracle database files are pretty sparse and, at a 
 block level, very few things change, so I was hoping that 
 using 'rsync' instead of 'rcp' to do my daily backups would be 
 IMMENSELY faster.

If you notice, the critical thing in the formula is the number of
_different_ bytes not really the whole file size. If you know less
than 100M of the file is different, then use the block size recommended
for a 100M file.

For oracle database files, you probably also want to use a block size
that is a multiple of whatever internal unit size oracle uses.

 As per the final messages on the thread, just increasing the csum 
 from 2 to 4 bytes seemed to miraculously solve the problem, right? So 
 why isnt this patch being included?

because it breaks backwards compatibility with older rsync versions...
it needs a protocol version change which is a bit trickier to introduce
without causing problems.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync vs. rcp

2003-02-19 Thread Donovan Baarda
On Thu, 2003-02-20 at 11:36, Craig Barratt wrote:
  RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently
  large block size. See the following;
  
  http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html
 
 Let's be careful here.  Rsync *does* work on 1GB+ files.  What you
 probably meant to say was that rsync's first pass might fail on
 files that have 1GB+ of changes.  But the second pass (which uses
 the full MD4 block checksum) will work correctly.

Yeah, I was referring to the rsync algorithm as implemented in rsync,
not the rsync tool as a whole. Even then, it arguably does not work in
terms of it ends up transferring much more data (up to 2x) than a pure
scp does. It does end up correctly transferring the file though.
 
 So a more correct statement is that Rsync might work *slowly* on
 files with 1GB+ of changes because two passes are required.
 
 BTW, rsync already has an adaptive block size unless you set it
 on the command-line.  The block size is roughly
 
 min(16384, max(700, file_size/1))
 
 ie: file_size/1, but no less than 700 and no more than 16384.

I wasn't aware that it had this. Was it there at the time of the
original discussion (Oct 2002)? The people involved in the discussion
then didn't seem to know this.

However, it's not really adequate. A 16K block size only really works
for files up to about 500M. Still... that's a lot better than I thought
it was at the time.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync vs. rcp

2003-02-19 Thread Donovan Baarda
On Thu, 2003-02-20 at 13:20, jw schultz wrote:
 On Thu, Feb 20, 2003 at 01:03:16PM +1100, Donovan Baarda wrote:
  On Thu, 2003-02-20 at 11:36, Craig Barratt wrote:
RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently
large block size. See the following;
[...]
  However, it's not really adequate. A 16K block size only really works
  for files up to about 500M. Still... that's a lot better than I thought
  it was at the time.
 
 I'd be very open to increasing the upper limit on the dynamic
 block size but it should be done in balance with increasing
 (dynamically per-file) the block csum-length.

increasing the block size is kinda a hack to compensate for not being
able to increase the csum-length. If you can scale the csum-lenth, then
you don't need to increase the block size to ensure the rsync algorithm
works (but you still probably want to do it to keep your signature size
and processing overheads down).

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync 1tb+ each day

2003-02-05 Thread Donovan Baarda
On Thu, 2003-02-06 at 10:01, Eric Whiting wrote:
 Kenny Gorman wrote:
[...]
 I tested with a small 256M datafile. rsync -av is showing me about 200kBytes of
 changes in the datafile between each snapshot. (about 1/1000th of the file has
 actually changed between the hot backups) Rsync reports the speedup as somewhere
 between 800 and 1000 just as I would expect. This speedup is a number that
 accounts only for bytes transferred (or not transferred) -- not real time. When
 I time the rsync -av runs and compare to a rsync -W the real run times are
 approximately the same. (similar results if I rm the file on the dest and rsync
 -av). block checksum and rewrite overhead right? 
[...]
 Similar speedup numbers for some 1G database files -- but the real time is 2x
 when the destination is there and mostly correct. For this type of data it is
 'faster' from a time standpoint to remove the destination file before running
 rsync. Not what I would expect. 

You are probably seeing 2x because you are running rsync on a 1G file
without a sufficiently large block size. When you do this rsync actually
transfers the file twice; once using the rsync algorithm, and again
sending the whole file because the rsync transfer corrupted the file (as
indicated by the final checksum). There is a 99.7% chance the rsync
transfer will fail for a 1G file and the default block size.

This is because rsync doesn't use sufficiently large signature checksums
for large files, which results in corruption of the transferred file.
Fortunately rsync detects this using a final checksum, and falls back to
a whole file transfer.

To benefit (95%+ chance of success) from using rsync on a 1GB file, you
need to use a block size of 75kB or greater. For details, see;

http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html

In any case, you might find that setting larger block sizes improves
performance when you have a fast network; large block sizes reduce the
CPU overhead at the expense of reduced traffic savings. You might find a
sweet spot somewhere that is faster than -W for your particular
situation. 

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Proposal that we now create two branches - 2_5 and head

2003-01-29 Thread Donovan Baarda
On Wed, 2003-01-29 at 18:22, Craig Barratt wrote:
  I have several patches that I'm planning to check in soon (I'm waiting
  to see if we have any post-release tweaking to and/or branching to do).
  This list is off the top of my head, but I think it is complete:
 
 And I have several things I would like to work on and submit:
 
  - Fix the MD4 block and file checksums to comply with the rfc
(currently MD4 is wrong for blocks of size 64*n, or files
longer than 512MB).

I'm interested in contributing to this effort too; I have helped fix and
optimise the md4sum implementation for librsync. Once the issue of
protocol backwards compatibility is resolved, this code could be dropped
as is into rsync (up to 20% speed increase over current implementation).

However, I also want to change to using the RSA implementation API, and
submit this code upstream to libmd (it has an optimised md5sum, but only
the RSA md4sum). That way all the projects that used the RSA
implementation could benefit from a more optimised md4sum (and can
contribute bug fixes etc).

  - Adaptive first pass checksum lengths: use 3 or more bytes of the MD4
block checksum for big files (instead of 2).  This is to avoid almost
certain first pass failures on very large files.  (The block-size is
already adaptive, increasing up to 16K for large files.)
[...]
 But before I work on these I would like to make sure there is interest
 in including them.

IMHO adaptive checksum sizes is critical. People are starting to rsync
partition images and the maths shows rsync degenerates really badly as
file sizes get that big.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



RE: Proposal that we now create two branches - 2_5 and head

2003-01-29 Thread Donovan Baarda
On Thu, 2003-01-30 at 07:40, Green, Paul wrote:
 jw schultz [mailto:[EMAIL PROTECTED]] wrote:
 
 [general discussion of forthcoming patches removed]
 
  All well and good.  But the question before this thread is
  are the changes big and disruptive enough to make a second
  branch for the event of a security or other critical bug.
 
 Agreed.
[...]

After reading arguments, is support the delay the branch till it
absolutely must happen approach... ie don't branch until a bugfix needs
to go in to a stable version and HEAD is way too unstable to be
released with the fix as the new stable.

 Quite true.  But I'd like to make the point that I think it is worth making
 the decision to split now.  Having two branches will change attitudes.  And
 I think with as large a community of users as rsync clearly has, it is worth
 changing attitudes.  Having a production branch will remind us that we have
[...]

Actually, a bigger attitude issue for me is having a separate
rsync-devel and rsync-user lists. I have almost unsubscribed many times
because of the numerous newbie user questions; I'm only interested in
the devel stuff. I'm sure there are many users who have unsubscribed
because of the numerous long technical posts.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync protocol description

2002-12-18 Thread Donovan Baarda
On Wed, 2002-12-18 at 19:03, Christoph Bartelmus wrote:
 Donovan Baarda wrote:
   For that I'd be very much interested in a description of the protocol
   that rsync talks on port 873. Is such a description available somewhere?
   I don't want to have to deduce the protocol from the source code.
  
  The protocol changes as rsync is updated, and is very evolved
 
 Does that mean that rsync does not maintain downwards compatibility? My
 former examination did not indicate this. The startup protocol includes
 the exchange of protocol versions in use. I guess there's a reason for
 it. Otherwise I could imagine that this could be a steady cause of
 trouble.

rsync does mantain backwards compatibility... that is potentially part
of the problem. The new versions check the protocol version and will use
old protocols if required. This means rsync has to support all the old
various broken forms of the protocol. 

A small example of this is the md4sum code has a bug that makes md4sums
incorrect for large files. This does not break the file transfers in any
way because it is consistantly incorrect. However, if someone fixed
this, it would mean rsync would have to include two md4sum
implementations; the correct one, and the broken one for backwards
compatibility.

if you were to implement your own rsync client, you would either have to
reject older versions of the protocol, or implement them all. In any
case, you will be constantly playing catchup with rsync. 

  re-implementing and tracking it is not really viable/desirable.
  
   What I have found so far are the description of the algorithms (rsync
   technical report and Andrew Tridgell's PhD thesis). Have I overlooked
   something obvious?
  
  http://freshmeat.net/projects/pysync/
 
 The ftp-server minkirri.apana.org.au gives me a 421 Service not
 available, remote server has closed connection.

Hmmm, that's wierd. The logs show nothing wrong, and 8 transfers today
and no down time. However, when I looked I did notice that I'd screwed
up the alternative http access yesterday when I fiddled with my apache
configs.

If ftp is causing the problems, try;

http://minkirri.apana.org.au/pub/python/pysync/

It should be working OK (now :-).

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync protocol description

2002-12-17 Thread Donovan Baarda
On Wed, 2002-12-18 at 02:11, Christoph Bartelmus wrote:
 Hello,
 
 I'm currently evaluating the possibility of implementing a rsync client
 in a project for my company. The platform used is currently not
 supported and implementing the client from scratch currently seems to be
 the most feasible approach.

You do _not_ want to do this, unless you are prepared to make your
client implementation incompatible with the real rsync.

If you need real rsync compatibility, then your best bet is to port
rsync to the platform and submit your patches upstream. This is your
best chance of achieving and maintaining rsync compatibility.

I you are prepared to forgo rsync compatibility, you should look at
librsync. There are also various prototype implementations like
superlifter, pysync, etc.

 For that I'd be very much interested in a description of the protocol
 that rsync talks on port 873. Is such a description available somewhere?
 I don't want to have to deduce the protocol from the source code.

The protocol changes as rsync is updated, and is very evolved
re-implementing and tracking it is not really viable/desirable.

 What I have found so far are the description of the algorithms (rsync
 technical report and Andrew Tridgell's PhD thesis). Have I overlooked
 something obvious?

http://freshmeat.net/projects/pysync/

is a Python implementation of the algorithm and various variations like
xdelta in ~400 lines (mostly documentation)... can be useful for
understanding the algorithm or prototyping stuff.

ABO

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync to 2000/NT servers?

2002-12-17 Thread Donovan Baarda
On Tue, 2002-12-17 at 21:13, John Morgan Salomon wrote:
 
 
 Dr. Poo wrote:
 
  Now, can you think of a way to sync the win 2000 OS? (the WHOLE flippin' 
  system) so that if it were to go down one could restore the full installation 
  (bootstraps, bootloader, ect!!?) by means of the rsync'ed backup.
  please? thank you. ;-)
 
 Yeah.  Symantec Ghost.

Seriously, Ghost is probably the only tool that can do this well at the
moment. If you have used NTFS, there is bugger all that can read it, and
win2000/NT itself can't even read all of the partition it is running on.

I read a review of Ghost that complained that backups/restores actually
boot you into DOS to run. This is because win2000/NT locks certain OS
files when running. Ghost _has_ to boot into DOS and then use its own
NTFS read/write code to access the partition.

In theory you could build a similar utility from Linux, but the Linux
NTFS read/write code is a bit dodgey. I've even had older versions of
Ghost complain because it couldn't always read the NTFS partition
(something to do with NTFS internally... sometimes it does/doesn't use
some sort of weird compaction or something).

The best non-Ghost alternative is to just rsync the whole damn
partition.

ABO

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: FTP at rsync.samba.org down?

2002-11-25 Thread Donovan Baarda
On Mon, Nov 25, 2002 at 11:44:10AM +0100, Michael Schmidt wrote:
 
 Hello,
 
 just being fascinated by rsync I wanted to look at the distribution files 
 at ftp://rsync.samba.org, but it was not possible to get a ftp connection 
 to that address.  Is the ftp service down there?
 
 Thanks in advance for any soon helpful hint.

If you want an easy way to see how rsync works, have a look at pysync...
python implementation in under 400 lines (mostly documentation). Also
includes implementations of related algorithms like xdelta.

http://www.freshmeat.net/projects/pysync


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: weak checksum question

2002-11-12 Thread Donovan Baarda
On Mon, Nov 11, 2002 at 08:06:40PM -0500, Jeff Abrahamson wrote:
 The weak checksum in checksum.c (see snippet below) differs
 substantially from the one discussed in Andrew Tridgell's doctoral
 thesis on rsync and elsewhere that I've been able to find. I didn't
 find discussion of the change in the mailing list archives. Well, so
 I'm curious what the benefit of the change is.
 
 (I'm using a rolling checksum in another app, thus the reason for this
 nitty inspection.)
 
 Can anyone provide a pointer to help me understand why this changed?

There are quite a few variations of the rolling checksum. All that I know of
are varients of what I know of as a Fletcher checksum. I believe the rsync
documentation describes a straight Fletcher checksum, but also shows how it
can be efficiently rolled.

I believe rsync uses a straight Fletcher checksum. librsync adds an offset
to each byte before it is fed into the Fletcher checksum. xdelta uses a
random lookup table before each byte is fed into the Fletcher checksum.
Pysync includes an Adler32 implementation in Python, which is the same as a
Fletcher except s1 and s2 are mod'ed by the largest 16 bit prime each time.
pysync also includes a C implementation of the librsync version except
implemented using a very fast unrolled macro idea stolen from the adler32
implementation in zlib. The C implementation in pysync is probably the
fastest one currently around.

I'm unconvinced the extra effort of Adler32's mod-prime, librsync's
offset, or xdelta's random lookup actually buy you anything. I have a
feeling that they all end up with pretty much the same probability of a
checksum clash occuring. This would suggest that a pure fletcher checksum
is the best because it has the lowest overhead, but librsync's offset
doesn't hurt much anyway.

 In checksum.c:
 
 uint32 get_checksum1(char *buf1,int len)
 {
 int i;
 uint32 s1, s2;
 schar *buf = (schar *)buf1;
 
 s1 = s2 = 0;
 for (i = 0; i  (len-4); i+=4) {
 s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] + 
   10*CHAR_OFFSET;
 s1 += (buf[i+0] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET); 
 }
 for (; i  len; i++) {
 s1 += (buf[i]+CHAR_OFFSET); s2 += s1;
 }
 return (s1  0x) + (s2  16);
 }

This elaborate confusion is an attempt to optimise the fletcher with
offset varient by processing the input buffer 4 bytes at a time.
Unfortunately, this doesn't buy you much because all those multiplies end up
costing more than an unravelled series of additions. 

I posted a patch somewhere that replaced this implementation with the
rollsum.c code taken from pysync... If you want to see that implementation
have a look at pysync on Freshmeat.

 What I understood from documentation:
 
 /* Rolling checksum from Tridgell's PhD thesis. */
 int get_weak_checksum(signed char *buf, int len)
 {
 int s;
 int s1, s2;
 int len4;
 
 s1 = s2 = 0;
 len4 = len / 4;
 for(i = 0; i  len - 4; i += 4) {
 s = (buf[i]  24) + (buf[i+1]  16)
 + (buf[i+2]  8) + (buf[i+3]);
 s1 += s;
 s2 += (len4 - 4 * i) * s;
 }
 return (s2  16) + (s1  0x);
 }

I think you got the documentation wrong... although maybe I remember it
wrong :-) I thought the rsync documentation described this;

{
  s1=s2=0;
  for (i=0, i len; i++) {
s1+=buf[i];
s2+=s1;
  }
  return (s2  16) | (s1  0x);
} 

The fletcher with offset varient is exactly the same, it just replaces
s1+=buf[i] with s1+=buf[i]+OFFSET. The xdelta version replaces it with
s1+=LOOKUP[buff[i]] where LOOKUP is constant array with 256 random integers. 

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: offline rsync

2002-11-07 Thread Donovan Baarda
On Thu, Nov 07, 2002 at 08:09:42PM +0100, Franco Bagnoli wrote:
 On Thu, 7 Nov 2002, Donovan Baarda wrote:
 
  rdiff, part of librsync. There is also pysync, but it's much slower (but
  easier to understand/modify). 
 
 I know I'm sloppy, but the documentation with rdiff is rather poor. Is 
 there a lengthier explanation somewhere? 

There is a manpage with rdiff that explains how to use it. On Debian systems
just install the rdiff deb. Be aware though that the 0.9.5 version of
rdiff/librsync has a bug that results in larger than necisary delta's. The
CVS for the librsync project on SF has a much better version in it.

 Moreover, since the synchronization stuff I'm looking for will end into a 
 perl script, has anybody developed an equivalent of pysync in perl? I've 
 found nothing on cpan. 

I think someone has written a perl interface to librsync... have a search of
the rproxy project list archive on SF (the librsync project started as part
of the rproxy project).


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: offline rsync

2002-11-06 Thread Donovan Baarda
On Wed, Nov 06, 2002 at 10:14:28AM -0600, Franco Bagnoli wrote:
 hello, I'm new to this list. 
 
 here is my question: 
 
 I would like to synchronize two computers (say the home one and the job
 one) using zip drives or similar (cdroms, etc), since modem lines are
 quite slow and expensive (in Italy). 
 
 I though I could  produce the signature of files on home computer, store
 it on a zip, go to job, run rsync to copy the missing or
 altered files on the disk (possibly in zipped form) and produce the new
 signature file, and repeat it once at home. 
 
 I think this could also be seen as a backup system (on cdrom or
 similia).
 
 Is this feasible with rsync? Is there a better approach? 

rdiff, part of librsync. There is also pysync, but it's much slower (but
easier to understand/modify). 

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Optimizations and other questions for rsync

2002-10-17 Thread Donovan Baarda
On Wed, Oct 16, 2002 at 04:56:52PM -0700, jw schultz wrote:
 On Wed, Oct 16, 2002 at 05:00:02PM -0400, Farid Shenassa wrote:
  1. is there any computational or disk IO difference between the rsync client
  and server (the one that does just the checksum on the block, vs the one
  that does rolling checksum).  Given that I do not have as much cpu on the
  VOS machine, I would like the more expensive side to run on the Windows
  system.  So I need to figure out who should run the daemon and who should
  push vs. pull.
 
 Once the connection is established it doesn't matter whether
 you push or pull.  The difference is who sends and who
 receives.  The receiver bears the brunt of the work.
[...]

Correction, it is the sender who bears the brunt of the work. It recieves
the signature, then calculates and sends the delta. Calculating the delta is
the CPU intensive task. This is the reason many archive sites don't like
running rsync servers.


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-14 Thread Donovan Baarda

On Mon, Oct 14, 2002 at 12:36:36AM -0700, Craig Barratt wrote:
 craig My theory is that this is expected behavior given the check sum size.
[...]
 But it just occurred to me that checksum collisions is only part of
 the story.  Things are really a lot worse.
 
 Let's assume that the block checksums are unique (so we don't get
 tripped up from this first problem).  Let's also assume that the old
 file is completely different to the new one, ie: no blocks at any
 offset really match.  So rsync will compare the checksum at every
 byte offset in the file looking for any match.  If there are nBlocks
 blocks, each check has an nBlocks / 2^48 chance of a false match.
 
 Since this test is repeated at every byte offset, the probability
 that the file has no false matches is:
 
 p = (1 - nBlocks / (2^48)) ^ fSize
 
 where fSize is the size of the file (more precisely the exponent should
 be (fSize - 700)).
 
 Now for some numbers for 700 byte blocks:
 
   - 100MB file (104857600 bytes).  nBlocks = 149797.  p = 0.945.
 
   - 500MB file (524288000 bytes).  nBlocks = 748983.  p = 0.248.
 
   - 1000MB file (1048576000 bytes).  nBlocks = 1497966.  p = 0.003.
 
 So, on average, if you have a random new 1GB file and a random old
 1GB file, and you rsync then, the 1st phase will fail 99.7% of the
 time.
 
 Someone could test this theory: generate two random 500MB files and
 rsync them.  Try it a few times.  I claim that on average the first
 pass will fail around 75% of the time.
 
 Things get a lot better when the files are very similar.  For each
 block that matches, rsync skips the whole block (eg, 700 bytes)
 before it starts looking for matching checksums.  So for a file
 that is identical it only does nBlock checks, not fSize checks
 (700 times fewer).


Wow... scary. This can be generalised to;

   p = 1 - (1 - c/(2^b))^s
   
where:

  p is the probablility of a false match
  c is the number of blocks in the signature
  b is the number of bits in the checksum
  s is the number of different bytes

In conclusion, a blocksize of 700 with the current 48bit signature blocksum
has an unacceptable failure rate (5%) for any file larger than 100M, unless
the file being synced is almost identical.

Increasing the blocksize will help, with the following minimum sizes being
recommended for a 5% failure rate;

fileblock
 100M 1K
 200M 3K
 400M12K
 800M48K
   1G75K
   2G   300K
   4G   1.2M

Note that the required block size is growing faster than the file size is,
so the number of blocks in the signature is shrinking as the file grows. We
absolutely need to increase the signature checksum size as the filesize
increases.

 If my new hypothesis is correct we definitely need to increase the size
 of the first-pass checksum for files bigger than maybe 50MB.

Does the first pass signature block checksum really only use 2 bytes of the
md4sum? That seems pretty damn small to me. For 100M~1G you need at least
56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
64bits, but you should probably increase the block size as well/instead.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-14 Thread Donovan Baarda

On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote:
 On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
[...]
  Does the first pass signature block checksum really only use 2 bytes of the
  md4sum? That seems pretty damn small to me. For 100M~1G you need at least
  56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
  64bits, but you should probably increase the block size as well/instead.
 
 It is worth remembering that increasing the block size with
 a fixed checksum size increases the likelihood of two
 unequal blocks having the same checksums.

I haven't done the maths, but I think the difference this makes is
negiligable, and is far outweighed by the fact that a larger block size
means less blocks.

 I think we want both the block and checksum sizes to
 increase with file size.  Just increasing block size gains
 diminishing returns but just increasing checksum size will
 cause a non-linear increase in bandwidth requirement.
 Increasing both in tandem is appropriate.  Larger files call
 for larger blocks and larger blocks deserve larger
 checksums.
 
 I do think we want a ceiling on block size unless we layer
 the algorithm.  The idea of transmitting 300K because a 4K
 block in a 2GB DB file was modified is unsettling.

I think that command-line overides are critical. Just as you can force a
blocksize, you should be able to force a sigsumsize. However the defaults
should be reasonable.

BTW, the suggestions of using -c earlier in this thread ... from my reading
of the man page, -c will only help if the files happen to be identical... 

And the reason rsync does a second pass that succeeds is it transmits the
whole file if a final checksum of the file doesn't match after using the
rsync algorithm. 

Someone else suggested using a locality limit to minimize the candidate
blocks, and hence minimize the chance of errors... I think a locality
heuristic might be good for speeding up finding a match, but a locality
limit is no good because it sacrifices the chance of finding distant
matches. part of rsync's benefits is it can handle relocated data, and a
locality limit breaks that.

 Note for rsync2 or superlifter:
   We may want to layer the algorithm so that large
   files get a first pass with large blocks but
   modified blocks are accomplished with a second pass
   using smaller blocks.
   ie. 2GB file is checked with 500KB blocks and a
   500KB block that changed is checked with 700B
   so rsyncing the file would be almost like rsyncing
   a directory with -c.

layer algorithms with variable block sizes are tricky. At first glance it
seems like a good idea, but when you go to implement it, it gets all messy.
The big problem is miss data is not block aligned, and to handle
re-location of data, you need to pretty much transmit a small blocksize
signature of the whole file... by the time you do that you might as well
have just used a smaller blocksize.

Every time I think of layer/variable block tricks I think of another way
that it might work, but every time I persue them, they hit problems...

Another one that just sprang to mind; combining it with a client-side delta
technique;

1) client sets blocksize to something large.
2) client requests signature from server for unknown parts of target file
  (ie the whole file on first parse, large unknown blocks on subsiquent passes)
3) server sends requested signature.
4) client rolls through basis, searching for match data, and populates
   a sparse new file with the hits.
5) client reduces blocksize (by half? by quarter?).
6) while blocksize is larger than some minimum, goto 2)
7) client requests remaining missing blocks and completes sparse new
file.

This could work... it could also have serious problems when you go to
implement it...

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-14 Thread Donovan Baarda

On Mon, Oct 14, 2002 at 04:50:27PM -0700, jw schultz wrote:
 On Tue, Oct 15, 2002 at 02:25:00AM +1000, Donovan Baarda wrote:
  On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote:
   On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
  [...]
Does the first pass signature block checksum really only use 2 bytes of the
md4sum? That seems pretty damn small to me. For 100M~1G you need at least
56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
64bits, but you should probably increase the block size as well/instead.
   
   It is worth remembering that increasing the block size with
   a fixed checksum size increases the likelihood of two
   unequal blocks having the same checksums.
  
  I haven't done the maths, but I think the difference this makes is
  negiligable, and is far outweighed by the fact that a larger block size
  means less blocks.
 
 We've just seen one face of the checksum undersize.  This
 problem is after all because we have unequal blocks that
 have the same (truncated) checksums.  That is with 700 bytes
 being compressed to a 4 byte checksum.  Increasing the block
 size without increasing the checksum size increases the
 chance of unequal blocks having the same checksum.

Yeah, but I think when you do the maths, the variation this causes taipers
to nothing as the block size gets significantly large compared to the
checksum. All the maths we have been doing so far uses the simplifying
assumption that the block size is infinitely large compared to the checksum,
so the probability of a checksum clash is based entirely on the checksum
size.

But I haven't done the maths, so feel free to go through it and prove
otherwise :-)

what the heck... here it is;

  p = 1/2^b - 1/2^n

where:
  p is the probability that any two blocks have the same checksum, but are
  actually different.
  b is the number of bits in the checksum
  n is the number of bits in a block.
  
The first term is the probability any two random blocks have the same
checksum. The second term is the probability that any two random blocks are
actually the same. As you can see, as b - n, p - 0, but as n gets
significantly larger than b, it has less and less affect. We are talking
relative probability changes in the order 1/2^1000.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-12 Thread Donovan Baarda
On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote:
  My theory is that this is expected behavior given the check sum size.
 
  Craig,
   Excellent analysis!

I was a bit concerned about his maths at first, but I did it myself from
scratch using a different aproach and got the same figures...

for those who are interested, an OK approximation for the maths turns out to
be;

   p = (c^2)/2^(b+1)

where:
   p is probability of collision
   c is number of blocks
   b is number of bits in checksum.

Provided c is significantly less than 2^b. As c approaches 2^b, p becomes
too large using this approx(when c=2^b, p should be 1).

   Assuming your hypothesis is correct, I like the adaptive checksum
 idea.  But how much extra processor overhead is there with a larger
 checksum bit size?  Is it worth the extra code and testing to use an
 adaptive algorithm?

There is no extra processor overhead, because the full md4sums are currently
calculated anyway, before they are truncated down to minimise the signature
size.

   I'd be more inclined to say This ain't the 90's anymore, realize
 that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV)
 and that people are moving from dialup to DSL/Cable, and then make either
 the default (a) initial checksum size, or (b) block size, a bit larger.

librsync has options to specify the strong sum size, though I think it
currently just uses the full md4sum size. pysync also uses the full md4sum
size.

I'm not sure what best aproach is, but there should be a relationship
between block size, signature size, and file size.


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-12 Thread Donovan Baarda
On Sat, Oct 12, 2002 at 07:29:36PM -0700, jw schultz wrote:
 On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote:
   My theory is that this is expected behavior given the check sum size.
  
   Craig,
  Excellent analysis!
  
  Assuming your hypothesis is correct, I like the adaptive checksum
  idea.  But how much extra processor overhead is there with a larger
  checksum bit size?  Is it worth the extra code and testing to use an
  adaptive algorithm?
  
  I'd be more inclined to say This ain't the 90's anymore, realize
  that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV)
  and that people are moving from dialup to DSL/Cable, and then make either
  the default (a) initial checksum size, or (b) block size, a bit larger.
 
 I lean toward making the block-size adaptive.  Perhaps
 something on the order of 700  (filesize / 15000)  8KB
 Maybe both checksum size and block-size should be adaptive
 so that both track the file size so large files have larger
 but fewer checksums (compared to current defaults) and small
 files retain their current advantages.  

The ideal checksum size is totaly dependant on the number of blocks. If you
scale your blocksize so that the number of blocks is kept low enough, you
can stick with the fixed checksum size.
 
 Any change like this will require a protocol bump so should
 include the MD4 sum corrections as well.

does it really need a protocol bump? The protocol already supports adaptive
block sizes, so provided the default block size increases at a suitable rate
based on the file size, that's all that is needed.

It would be nice to also force a larger checksum size, but that probably
would require a protocol change... (I'm more familiar with librsync than
rsync so I could be wrong).

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Problem with checksum failing on large files

2002-10-11 Thread Donovan Baarda
On Fri, Oct 11, 2002 at 03:26:45PM -0700, Terry Reed wrote:
 
 
  -Original Message-
  From: Derek Simkowiak [mailto:dereks;itsite.com] 
  Sent: Friday, October 11, 2002 1:51 PM
  To: Terry Reed
  Cc: '[EMAIL PROTECTED]'
  Subject: Re: Problem with checksum failing on large files
  
  
   I'm having a problem with large files being rsync'd twice 
  because of 
   the checksum failing.
  
  I think this was reported recently.
  
  Please try using the -c option (always checksum) 
  and see if the makes the problem go away.
  
  This is a high priority bug for me (although I have not 
  yet experienced it).
  
  
  --Derek
 
 
 Using -c helps for the smallest file (900 MB), but has no effect on the
 larger files (e.g, 2.7 GB).  Most of my files are between 1.5 GB  3 GB.

I wonder if this is related to the rsync md4sum not producing correct
md4sums for files larger than 512M? 

The existing rsync md4sum implementation does not produce md4sums the same
as the RSA implementation for files larger than 512M... but I thought it
was consistant with itself so this didn't affect anything.

I posted a fixed md4sum implementation, but it would have required a new
protocol version number and various hacks for backwards compatability with
the old implementation before it could be used in rsync.

Has someone rolled in a fixed md4sum implementation to rsync without
checking to make it backwards compatible? If so, then ensuring you have the
same version of rsync at both ends would avoid the problem.

If not, then I dunno :-)

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: MD4 bug in rsync for lengths = 64 * n

2002-09-03 Thread Donovan Baarda

On Mon, Sep 02, 2002 at 12:44:48AM -0700, Craig Barratt wrote:
[...]
  I can submit patches if required for the md4code as tweaked/fixed for
  librsync. The fixed code is faster as well as correct :-)
 
 Sure, that would be great.  Otherwise, I would be happy to recreate
 and test a patch.

I have attached my latest versions of the librsync mdfour code. This code is
not yet available anywhere except here, but I will be posting it soon as a 
patch on the librsync project page at SF.

This is the final refinement of this code I plan to do before migrating the
API to match the RSA implementation. It includes bugfixes for sums larger
than 512MB, and produces identical results to the RSA implementation. It is
much faster than the RSA implementation or the previous samba/rsync derived
implementation, and is much cleaner than the last librsync submitted
optimisations.

I'm not exactly sure how much faster it is, but rough tests using librsync
show reductions of more than 20% for signature generation times, but I
suspect some of that would be because I have also optimised the rollsums.

If the rsync guys want to take advantage of the rollsum optimisations I have
encapsulated them into a neat standalone rollsum.[ch] implementation. Let
me know if you want them.

BTW, I have also merged my librsync delta refactor changes with the latest
SF librsync CVS and will be submitting a smaller cleaner patch. Are we still
using the rproxy-devel list for librsync, or is there another list
somewhere?

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--


/*= -*- c-basic-offset: 4; indent-tabs-mode: nil; -*-
 *
 * librsync -- the library for network deltas
 * $Id$
 * 
 * Copyright (C) 2000, 2001 by Martin Pool [EMAIL PROTECTED]
 * 
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public License
 * as published by the Free Software Foundation; either version 2.1 of
 * the License, or (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public
 * License along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#include types.h

struct rs_mdfour {
int A, B, C, D;
#if HAVE_UINT64
uint64_ttotalN;
#else
uint32_ttotalN_hi, totalN_lo;
#endif
int tail_len;
unsigned char   tail[64];
};


/*= -*- c-basic-offset: 4; indent-tabs-mode: nil; -*-
 *
 * librsync -- the library for network deltas
 * $Id: mdfour.c 1.3 Thu, 27 Jun 2002 15:48:38 +1000 abo $
 * 
 * Copyright (C) 2000, 2001 by Martin Pool [EMAIL PROTECTED]
 * Copyright (C) 1997-1999 by Andrew Tridgell
 * 
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation; either version 2.1 of the License, or
 * (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Lesser General Public License for more details.
 * 
 * You should have received a copy of the GNU Lesser General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

/* MD4 message digest algorithm.
 *
 * TODO: Perhaps use the MD4 routine from OpenSSL if it's installed.
 * It's probably not worth the trouble.
 *
 * This was originally written by Andrew Tridgell for use in Samba. 
 * It was then modified by;
 * 
 * 2002-06-xx: Robert Weber [EMAIL PROTECTED] 
 *   optimisations and fixed 512M support.
 * 
 * 2002-06-27: Donovan Baarda [EMAIL PROTECTED]
 *   further optimisations and cleanups.
 */

#include config.h

#include stdlib.h
#include string.h
#include stdio.h

#include rsync.h
#include trace.h
#include types.h
#include mdfour.h


#define F(X,Y,Z) (((X)(Y)) | ((~(X))(Z)))
#define G(X,Y,Z) (((X)(Y)) | ((X)(Z)) | ((Y)(Z)))
#define H(X,Y,Z) ((X)^(Y)^(Z))
#define lshift(x,s) (((x)(s)) | ((x)(32-(s

#define ROUND1(a,b,c,d,k,s) a = lshift(a + F(b,c,d) + X[k], s)
#define ROUND2(a,b,c,d,k,s) a = lshift(a + G(b,c,d) + X[k] + 0x5A827999,s)
#define ROUND3(a,b,c,d,k,s) a = lshift(a + H(b,c,d) + X[k] + 0x6ED9EBA1,s)

/** padding data used for finalising */
static unsigned char PADDING[64] = {
0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Re: MD4 bug in rsync for lengths = 64 * n

2002-09-03 Thread Donovan Baarda

Quoting Dave Dykstra [EMAIL PROTECTED]:

 Donovan or Craig, can you please post it as a patch to the current
 rsync
 source (including an increase in the protocol version and checks to do
 the
 old behavior when talking to an old version client).

I'm not the person to do this, as I don't know the rsync code and protocol 
version stuff that well. I've been working on librsync instead. (handballs to 
Craig :-).

I'm not sure what the best way of implementing the old behavior for talking to 
old versions would be. Perhaps implement a broken finalise in a 
seperate mdfour-old.c would be best. Then the the mdfour.[ch] can be common 
for librsync, rsync, and perhaps samba without including old broken cruft just 
for rsync. 

 Probably we wouldn't want to incorporate it until a major release
 anyway
 (2.6.0) but if it isn't developed soon and put into the patches
 diretory
 it's likely to be forgotten.

I've put the patches against librsync up on the librsync SF page. After I 
migrate it and librsync to the RSA implementation API I'll be submitting it to 
libmd. It would be nice if we had standard md4sum code that every project could 
use/debug/fix.

--
ABO
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: MD4 bug in rsync for lengths = 64 * n

2002-09-01 Thread Donovan Baarda

On Thu, Aug 29, 2002 at 03:17:34PM -0500, Dave Dykstra wrote:
 On Sun, Aug 04, 2002 at 01:19:43PM -0700, Craig Barratt wrote:
[...]
  However, as I'm sure is well-known,
  the Adler crc32 and MD4 computed by librsync don't match those in
  rsync 2.5.5.
 
 I do not recall hearing anybody mention that before.

I thought that the librsync and rsync delta etc was so radicaly different
that you could never get them talking... I assume that this is all going
towards making librsync and rsync talk, which would be nice.

  After swapping the crc32 endianess, changing RS_CHAR_OFFSET from 31 to
  0, and adding rsync's checksum_seed to librsync's MD4 they now agree,
  except in one case: when the total data length (including the 4 byte
  checksum_seed) is a multiple of 64, the MD4 checksums in librsync and
  rsync don't agree.  After reviewing the code in librsync, rsync and the
  original RSA implementation, I believe the bug is in rsync.  It doesn't
  call mdfour_tail() when the last fragment is empty.  Unfortunately
  this happens in the particularly common case of 700 + 4 = 64 * 11.
  The same bug occurs in both the block MD4s and the entire-file MD4.

This is the first detailed description of the problem I've seen. I've heard
it mentioned several times before, and thought that the md4 code in librsync
was the same as in rsync. I've looked and tweaked the md4 code in librsync
and could never see the bug so I thought it was a myth. I also thought that
samba used this code I wonder what variant it is using :-)

  The bug is benign in the sense that it is on both sides so rsync works
  correctly.  But it is possible (I am certainly not a crypto expert) that
  missing the trailing block (that includes the bit length) substantially
  weakens the computed MD4.
  
  The fix is easy: a couple of  checks should be =.  I can send
  diffs if you want.  But of course this can't be rolled in unless it
  is coupled with a bump in the protocol version.  
 
 Another bump in the protocol version is no problem.  Please submit a patch.

I can submit patches if required for the md4code as tweaked/fixed for
librsync. The fixed code is faster as well as correct :-)

  I saw some earlier
  email about fixing MD4 to handle files = 512MB (I presume this
  relates to the 64-bit bit count in the final block).  Perhaps this
  change can be made at the same time?
 
 Could you please post a reference to that email?  It isn't familiar to me
 and I didn't find it through google.  There have been other problems we've
 been seeing with with the end of large files and zlib compression, though.
 I wonder if it can somehow be related.

It may not have been on the rsync list, but on the librsync list... Please
note that there are several variants of the md4 patch floating around. I've
been meaning to seperate the latest md4 patch from my bigger librsync delta
refactor patch for some time.

One of my goals for librsync is to make it use the same md4 API as the RSA
code, as this seems to be more widely used and would allow drop-in use of
the RSA implementation for verification. I'm planning to make the API
changes then submit this implementation to libmd. 

Before the API change I'll release a final bugfix/optimize version using the
old API.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: new rsync release needed soon?

2002-07-31 Thread Donovan Baarda

On Wed, Jul 31, 2002 at 12:29:31PM -0600, Robert Weber wrote:
 On the subject of needed patches, I just recently completed a patch for
 librsync that fixed the mdfour code to have uint_64 or 2 uint_32's for
 size.  Without this, the checksums on files 512Megs are incorrect.  The
 code should drop into rsync without a hitch.  The same goes for mdfour in
 samba, becuase that is where librsync got the code from anyway.
 

Actually... I further refined and cleaned this patch and it's in my librsync
delta refactor patch. I believe this patch also offers a significant
performance increase... I can extract out just the mdfour changes into a
seperate patch and submit it, just tell me where you want it.


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: new rsync release needed soon?

2002-07-31 Thread Donovan Baarda

On Wed, Jul 31, 2002 at 04:28:55PM -0700, Wayne Davison wrote:
 On Wed, 31 Jul 2002, Robert Weber wrote:
  On the subject of needed patches, I just recently completed a patch for
  librsync that fixed the mdfour code to have uint_64 or 2 uint_32's for
  size.  Without this, the checksums on files 512Megs are incorrect.
 
 In order to interoperate with older versions of rsync, wouldn't we need
 to continue to generate the incorrect checksums on all but the newest
 (freshly bumped up) protocol number?
 
 ..wayne..

I'm not sure... it probably needs more analysis.

rsync primarily uses mdfour sums as the big checksums on blocks in the
signature. I think it highly unlikely that blocksizes of greater than 512M
are used, so this change wouldn't affect that.

However, it looks like rsync also uses mdfour for it's whole file
checksums when using the -c option. In this case, any file larger than 512M
would end up with a different whole file sum, causing it to do a full
signature/delta/patch when it is not needed. However, the resulting delta
would be very small, so this might be tolerable.

The biggest worry is I think rsync also does whole file sums for any
signature/delta/patch to double (triple?) check the final result. If this is
the case, then yes, fixing mdfour will break old rsync'ing of files greater
than 512M.

If this is really an issue, I can cripple the patch with one line so that
it conforms with the old mdfour code. This would give you the performance
benefits, and a single line change to uncripple it further down the track.
Alternatively I could add a modified finalise function to give crippled
sums, allowing the application software to pick what kind of sum it wants.

Someone on the librsync lists mentioned a while back that some rsync or
samba person had identified a bug in the mdfour finalise routines and had a
fix for it. I'm not sure if this 512M limit was it. I'd be interested to
know if/how this affects samba.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: timestamp on symlink

2002-07-28 Thread Donovan Baarda

On Mon, Jul 29, 2002 at 12:40:56PM +1000, Martin Pool wrote:
 On 28 Jul 2002, Michael Wang [EMAIL PROTECTED] wrote:
  rsync does not sync the timestamp on symlink (Solaris 8).
  
  It is probablly due to the limitation of Unix implementation
  of symlink, but I would like to know why rsync/Unix does not
  do this, and what we can do about it. Is the conclusion that
  rsync syncs everything except the timestamp on symlink?
 
 I don't think it is possible to set the time of a symbolic link.  A
 quick experiment on Linux seems to confirm that:
 
 !891 12:36 ~% python2.2
 Python 2.2.1 (#1, May  3 2002, 23:19:03) 
 [GCC 2.95.4 20011002 (Debian prerelease)] on linux2
 Type help, copyright, credits or license for more information.
  import os
  os.symlink('nothere', '/tmp/mylink')
  os.utime('/tmp/mylink', None)
 Traceback (most recent call last):
   File stdin, line 1, in ?
 OSError: [Errno 2] No such file or directory: '/tmp/mylink'

This is because most of python's os.xxx methods de-reference symlinks. You
get this error because 'nothere' doesn't exist. The correct way to get time
info on symlinks is to use os.lstat(), which doesn't de-reference links.

Python 2.1.3 (#1, Apr 20 2002, 10:14:34) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type copyright, credits or license for more information.
 import os
 os.symlink('nothere','mylink')
 os.lstat('mylink')
(41471, 64782L, 10L, 1, 1000, 1000, 7L, 1027913015, 1027913005, 1027913005)
 

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Block size optimization - let rsync find the optimal blocksize by itself.

2002-07-02 Thread Donovan Baarda

On Tue, Jul 02, 2002 at 11:06:30AM +0200, Olivier Lachambre wrote:
 At 09:19 01/07/2002 +1000, you wrote:
 
 [...]
 This relies on optimal block size being related for a set of files. I'm not
 sure that this heuristic actually applies, and I don't know how much benefit
 this would buy for all the complexity it would add.
 
 
 I think that many clients do not care about multiplying complexity by 2 or
 5, even if the speedup rate is only multiplied by 1.1.

The only problem is, multiplying complexity by 2 or 5 multiplies the bugs by
2 or 5. Are those clients still excited by the 1.1x speedup now?

The other problem with increasing complexity is the setting effect that
complexity causes. The more complex code becomes, the harder it gets to
modify, so that 1.1x speedup that introduced 5x complexity now makes it
impossible to implement a much better 1.5x speedup.

In my experience, complex solutions usually have worse performance than
simple ones anyway.

Occam's Razor rules :-)

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Block size optimization - let rsync find the optimal blocksize by itself.

2002-06-30 Thread Donovan Baarda

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
[...]
   Well, the first comment: during my work, I wanted to verify that the
 theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
 PHd Thesis was actually the good one, and when doing the tests on randomly
 generated  modified files I discovered that the size sqrt(78*n/Q) is the
 actual optimal block size, I tried to understand this by reading all the
 thesis, then quite a lot of documentation about rsync but I just can't
 figure out why the theorical  experimental optimal block sizes so much
 don't match. I _really_ don't think it's coming from my tests, there must be
 somewhat else. Maybe the rsync developpers have just changed some part of
 the algorithm. And also, even without using data compression during the
 sync, rsync is always more efficient as it should be theorically, actually
 between 1.5 and 2 times more efficient. Nobody will complain about that but
 I'd be happy if someone would be nice enough to explain me this thing.

I believe that the compression option turns on compression of transfered
data, including file lists, instruction streams etc. Even with compression
turned off, the miss data in the delta is still compressed. 

Another thing to be aware of is when rsync compresses miss data, it also
compresses all the hit data to prime the compressor with context, and throws
away the compressed output for the hit data.

Perhaps this context compression is affecting the optimal block size? 

   Now the auto-optimization algorithm when updating many files at a time.
 Let's consider a set of files to be updated. We will consider only the files
 which have been changed since the last update (e.g. we can find the other
 ones by sending a MD5 sum for each file and trying to match it). We sync the
 first file, but the client keeps the old local version and can find how many
 differences between the two files there is and then guess the optimal block
 size. We assume that the percentage of differences between the files is a
 bit the same in the same set of files. So we use for the second file the
 optimal size found for the first file. Then for the third file we use the
 (arithmetic or geometric?) average of the first two files and so on... Once
 we have synced a certain number of files (10? 100?) we always use the same
 size which is supposed to be the best one.
   Sorry I'm too long, hope you'll understand everything,
 Olivier

This relies on optimal block size being related for a set of files. I'm not
sure that this heuristic actually applies, and I don't know how much benefit
this would buy for all the complexity it would add.

 P.S. I am not a programmer of any kind so don't wait for me to write any
 line of C (I know I'm a bad boy).

It might be an interesting experiment to write a wrapper for rsync that
accumulated stats on all transfers it was used for and learned the optimal
block size to use. This would not require any changes to rsync, and could be
written in something like Python.


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Release 3 of rzync new-protocol test

2002-06-21 Thread Donovan Baarda

On Fri, Jun 21, 2002 at 03:46:39AM -0700, Wayne Davison wrote:
 The count of transferred bytes in the latest protocol is now below what
 rsync sends for many commands -- both a start-from-scratch update or a
 fully-up-to-date update are usually smaller, for instance.  This is
 mainly because my file-list data is smaller, but it's also because I
 reduced the protocol overhead quite a bit.  Transferred bytes for
 partially-changed files are still bigger than rsync because librsync
 creates unusually large delta sizes (though there's a patch that makes
 it work much better, it's still not as good as rsync).

I believe that the remaining difference is rsync does context compression
using zlib. I believe librsync does no compression at all yet. Even if you
zlib compress librsync's delta's, they will still be bigger than rsync
because of the context it uses... it compresses the whole file, hits and
misses, but only sends the compressed output for the misses. This means the
compressor is primed with data from the hits.

I think that the best solution for this is to do what xdelta is planning to
do... toss zlib and include target references as well as source references
in the delta instruction stream; do the compression yourself. 

One way to do this is implement xdelta-style non-block aligned matches
against the target, building a rollsum hash-tree as you go through it, and
run it alongside the rsync block match algorithm. However, this might not
work well in practice...

 In my speed testing, one test was sending around 8.5 meg of data on a
 local system, and while rsync took only .5 seconds, my rzync app took
 around 2 seconds.  A quick gprof run reveals that 98% of the runtime is
 being spent in 2 librsync routines, so it looks like librsync needs to
 be optimized a bit.
 
 One potential next steps might include optimizing rsync to make the
 transferred file-list size a little smaller (e.g. making the transfer of
 the size attribute only as long as needed to store the number would
 save ~4-5 bytes per file entry on typical files).
 
 It looks like work needs to be done on making librsync more efficient.

I'm going to get onto this after this week end. I know what needs to be
done... I just need the time to do it.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Strong encryption

2002-06-05 Thread Donovan Baarda

On Wed, Jun 05, 2002 at 03:33:23AM -0700, 'jw schultz' wrote:
 On Wed, Jun 05, 2002 at 12:21:18PM +0200, C.Zimmermann wrote:
   
   If you want them stored on the destination encrypted you
  
  Yes, that?s it. The owner of the source files will be sure, that no one
  else can read his files on the destination host.
  
  I thought, rsync only looks at the modification date of a file and
  decides whether to backup this file or not. In this case, the backup
  could be stored  encrypted.
 
 Rsync can handle encrypted files just fine.  It just treats
 them as ordinary binary files.  If the owner of the files
 encrypts them on the source they will be encrypted on the
 destination.
 
 As you have said rsync normally just looks at the
 modification date for deciding whether to update the
 destination (unless you use the -c option)  But, unless the
 -w option is used rsync will use some rather clever (nod to
 rsync developers) methods to transfer only the changed the
 parts of changed files.  It is this feature that gives rsync
 its speed.  My comments below are to advise you that that
 clever feature is nullified by encrypted files.  In fact for
 encrypted files rsync would be sub-optimal.  If most/all of
 the changed files are encrypted i would use the -w option.

Perhaps he wants a gpg-rsyncable patch to gpg?

Seriously, such a thing would be possible. After the long thread in which I
made a dick of myself discussing how gzip-rsyncable couldn't possibly work,
only to have someone explain how it does, I now know how it could be done;
encript the data in chunks, where the chunk boundaries are determined by the
same heuristic gzip-rsyncable uses.

I think I could even whip up a working Python wrapper around gpg in a day or
so that could do the job... but I'm too buisy looking for paid work right
now to do it so I'll leave it as an excersise for others :-)

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Compressed backup

2002-06-04 Thread Donovan Baarda

On Tue, Jun 04, 2002 at 05:43:17PM +1000, Kevin Easton wrote:
[...]
 If you'll indulge me, I'll just restate the problem (as I see it, anyway) 
 before chiming in with my idea...
[snip big discription of why gzip-rsyncable actually does work]

Thanks for the discription of how gzip-rsyncable actually works. I should
learn to do some more research before shooting my mouth off. I must have
sounded pretty clueless... the heuristic reset idea is brilliant.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Compressed backup

2002-06-01 Thread Donovan Baarda

On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
 On Fri, May 31, 2002 at 11:45:43AM +1000, Donovan Baarda wrote:
  On Thu, May 30, 2002 at 03:35:05PM -0700, jw schultz wrote:
[...]
  I would guess that the number of changes meeting this criteria would be
  almost non-existant. I suspect that the gzip-rsyncable patch does nearly
  nothing except produce worse compression. It _might_ slightly increase the
  rsyncability up to the point where the first change in the uncompresstext
  occurs, but the chance of it re-syncing after that point would be extremely
  low.
 
 Actually many file modifications do just fine.  The key
 being to recognize that any plaintext modification will
 alter the compresstext from that point to the end.
 Most content modifications alter the blocks nearest the end
 of the file.  Think about how you edit text and Word processor
 documents.

So it is not possible for rsync to get any matches on a gzip-rsyncable
compressed file after the first modification. Does the gzip-rsyncable patch
actually improve the rsyncability of compressed files at all? AFAIKT, files
compressed normaly should be pretty much rsyncable up to the same point.
Reseting the compression every 4K probably does allow you to rsync closer up
to that point, but only because the resets make the compression less
efficient... ie any savings from matching closer to the modification are
lost because of the overall larger file.

[...]
 This trend will also affect several other aspects of systems
 and network administration.  We are rapidly approaching a
 day when most application files stored in home directories
 and shared work areas will be compressed.  This means that
 that those areas will not benifit from network or filesystem
 compression.  And our so-called 200GB tape drives will
 barely exceed 1:1 compression and only hold 100GB of these
 types of files.  I expect non-application files to remain
 uncompressed for the forseeable future but we should
 recognize that the character of the data stored is changing
 in ways that disrupt the assumptions many of our tools are
 built upon.

I think that the increased use of compressed files is going to require that
rsync-like tools become compression aware, and be smart enough to
decompress/recompress files when syncing them. I see no way around it, other
than throwing heaps of bandwidth at the problem :-). Needless to say this
will make the load on servers even worse. However server side signature
caching and client side delta calculation would probably end up making the
load on servers even lower than it currently is.

[...]
  I don't think it is possible to come up with a scheme where the reset
  windows could re-sync after a change and then stay sync'ed until the next
  change, unless you dynamiclly alter the compression at sync time... you may
  as well rsync the decompressed files.
 
 The only way to do it is to make a content-aware compressor
 that compresses large chunks and then pads the compresstext
 to an aligned offset.  That would be too much waste to be a
 good compression system.

even this wouldn't do it... the large chunks would have to be split on
identical boundaries over unchanged uncompressedtext in the basis and the
target. The only way this could be achieved would be if the target was
compressed using resets on boundarys determined by analysing the changes and
boundaries used when the basis was compressed. If the end that has the
target file has that degree of intimate knowege about the other end's basis
file, then you can toss the whole rsync algorithm and revert to some sort of
compressed xdelta.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Compressed backup

2002-06-01 Thread Donovan Baarda

On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
 On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
  On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
[...]
 When i said content-aware compressor  what i meant was
 that the compressor would actually analize the plaintext to
 find semantically identifiable blocks.  For example, a large
 HOWTO could be broken up by the level-2 headings  This would
 be largely (not always) consistant across plaintext changes
 without requiring any awareness of file history.  Have rsync
 be compression aware.  When rsync hits a gziped file it
 could treat that file as multiple streams in series where it
 would restart the checksumming each time the compression
 table is reset.

That's actually pretty clever... It provides a neat way around the XML
meta-data at the front changing problem... just have the XML compression
reset at suitable boundaries in the XML file...

You wouldn't need rsync to be compression aware. Provided the adjacent
unchanged segments between compression resets were larger than the rsync
blocksize, you would only miss block fragments at the beginning and end of
each matching sequence(as rsync already does). However, by making rsync
gzip-reset aware, you could do interesting things with variable block sizes,
where the file itself specifies the rsync block boundaries. Hmmm, more
though needed on how this would integrate with the rolling checksum
though... I suspect you could toss the rolling checksum and just look for
matching blocks as defined by the resets, because its not going to match at
an arbitary byte boundary anyway.

 I can't see this actually happening but it could work where
 the compression is done by the application that creates the
 file.  If, and only if, that were to be done so that there
 were enough to be worthwhile then rsync could be made
 compression aware in this way but that would require a
 protocol change.

putting content-aware compression resets at appropriate points will make
files (a bit) rsyncable with rsync as it already stands. Making rsync
compression-reset aware would only improve things a little bit. However, the
resets _must_ be content-aware, they must occur on likely to match
boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
currently does.

 Your idea of having rsync actually do the checksums on the
 plaintext of compressed files might have some merit in
 future.  It would mean essentially that we would zcat the
 source twice and the destination would be ungzipped, merged
 and then regzipped.  Gastly as far as CPU goes but would
 help save us network bandwidth which is growing at a lower
 rate.  The questions are, what is the mean offset of first
 change as a proportion of file size and are enough files
 gzipped to merit the effort?

This is what I believe xdelta already does for compressed files. The deltas
are computed for the uncompressed data on files that are identified as
compressed.

I don't see why you would zcat the source twice... 

The destionation would ungzip the basis file, calculate and send a sig. The
source would zcat the target, feeding it through the rolling checksum,
looking for matches and sending the delta to the destination. The
destination would receive and apply the delta, then gzip the result.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Compressed backup

2002-06-01 Thread Donovan Baarda

On Sat, Jun 01, 2002 at 05:18:42PM -0700, jw schultz wrote:
 On Sat, Jun 01, 2002 at 11:46:37PM +1000, Donovan Baarda wrote:
  On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
   On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
[...]
  putting content-aware compression resets at appropriate points will make
  files (a bit) rsyncable with rsync as it already stands. Making rsync
  compression-reset aware would only improve things a little bit. However, the
  resets _must_ be content-aware, they must occur on likely to match
  boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
  currently does.
 
 chunk: section of file gzipped without reset.
 block: section of file that dest has checksumed.

good terminology :-) let me add;

basis : the oldfile on the destination that is being updated to the target
target: the newfile on the source.

 The buy-back of the rolling checksums is to allow subsequent
 blocks to match at different offsets when a length change
 occurs mid-file.
 
 Without a gzip reset aware variable blocksizes rsync
 wouldn't gain anything from a gzip reset because the reset

yes it would...

 would occur in the middle of a block.  The rolling checksums
 would be droppable, however because matching blocks within a
 gzip chunk would not be relocatable within that chunk.

yes it would, provided the block size is smaller than the compressed size of
a sequence of unchanged adjacent chunks. The first and last block fragments
of the matching sequence of chunks would be lost, but all the blocks in the
middle would match. The rolling checksum would re-sync at the first complete
matching block.

Provided the context-sensitive location of compression resets ensures they
occur at the same points in the target and basis around unchanged chunks,
rsync will re-sync and find matches after any change, provided the
matching sequence of chunks is larger than the blocksize.

 For any gain rsync blocks need to have block-aligned offsets
 within the gzip chunks.

There are two things block-aligned offsets in gzip chunks buys you; slightly
better sig's for the basis, and slightly less missed block fragments in the
delta. The reason you get better sigs is any block with a reset in the
middle has a useless checksum in the sig unless both chunks on either side
of the reset are unchanged. The reason you get a slightly better delta is
you avoid missing the block fragment at the beginning and end of a matching
sequence of chunks. 

Both of these are related and negligable provided the block size is small
compared to the size of matching sequences of chunks. These problems already
apply to rsync on uncompressed files, and are the reason xdelta can produce
more optimal deltas. Attempting to align compressor resets with block
boundaries is akin to tweaking block sizes on uncompressed data; you can
improve things a little, but the effort vs return is only really worth it
for very special cases. If you align resets to block at the expense of
locating them contextually, you will just make things worse.

 The first reason the destination zcats twice is that we might lack
 the disk space to store the uncompressed files.  The only
 reason to go to this much work for gziped files is because
 there are many large ones.  Therefore, we dare not leave
 them around uncompressed even when rsync works on one
 directory at a time.  The other reason is that the scanning
 and generation processes should not alter the trees.

I don't think you can efficiently get away with just using zcat when merging
deltas. The reason is the basis has to be seekable... because rsync can find
and use matches in files that have been re-ordered, you sometimes have to
seek backwards. The only way I can see to 'seek' without a full
decompression is to restart decompression every time you seek backwards. You
can sortof improve this by preserving the decompressor state at various
points so you can start seeking from there... I see a seekable-zcat class
in my mind already :-)

conclusion:

I don't think gzip-rsyncable helps much at all, certainly not enough to
warrent including it in the official gzip (figures that demonstrate
otherwise are welcome).

Context sensitive location of compression resets by applications that try to
place resets at the same points around unchanged chunks will result in
rsyncable compressed files (Compressed-XML'ers take note, a compression
reset after the meta-data section and at significant points in the XML
could make a huge difference).

Application specific rsync-alike updates of compressed data with contextual
location of compression resets will get better results if they toss the
rsync rolling checksum and synchronise using sigs for 'chunks' rather than
rsyncs 'blocks'.

The effort vs return of accurately locating compression resets contextually
around unchanged 'chunks' depends heavily on the type of data. The best

Re: Improving the rsync protocol (RE: Rsync dies)

2002-05-20 Thread Donovan Baarda

On Mon, May 20, 2002 at 09:35:04PM +1000, Martin Pool wrote:
 On 17 May 2002, Wayne Davison [EMAIL PROTECTED] wrote:
  On Fri, 17 May 2002, Allen, John L. wrote:
[...]
 I've been thinking about this too.  I think the top-level question is
 
   Start from scratch with a new protocol, or try to work within the
   current one?

tough question... to avoid backwards breakage and yet implement something
significantly better you would probably have to make two rsyncs in one
executable; the new protocol, and the old one for a compatible fallback
talking to old versions. After enough time had passed and all old rsync
implementations had been purged, the old code could be dropped, leaving a
nice clean small solution.

I tend to think that once a delta compressing http extension gets mainstream
acceptance, rsync will fade away _unless_ it offers significantly better
performance by avoiding http overheads (which is why ftp lives on, despite
being a bastard protocol from hell).

 algorithm or codebase, or need to evolve the current one.  I think the
 nature of the current protocol is that it will be hard to make really
 fundamental improvements without rewriting it.
[...]

My feelings too.

 I wrote librsync.  There is some documentation and I can add more if
 there's anything undocumented.

I really want to do some work on librsync. I recently did some work writing
a Python swig wrapper for it and identified several areas where it could be
improved. I already have a more modular implementation of the rolling
checksum that is 2~3x faster that I want to integrate with it.

 I haven't looked at pysync as much as it deserves, but it could be a
 good foundation.

Pysync now includes a Python extension module to librsync. I have also
implemented a Python wrapper around the rolling checksum mentioned above
that makes the Python version run nearly 2x as fast as the pure Python
adler32 version. This will be released in the next release.

I don't think Python is viable for a final rsync solution, even using
librsync as an extension module. The Python interpreter is an un-necisary
overhead for a lean-mean-data-transfer-machine. However, Python is _perfect_
as a means of experimenting with protocols and algorithms. Pysync was
written with this in mind, and in ~400 lines of heavily commented Python
implements both the rsync and xdelta algorithms. The next release will add
inverse-rsync (for client-side delta calculation). Note that pysync _does_
implement the context-compression used by rsync that is not included in
librsync yet.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: timeout error in rsync-2.5.5

2002-05-03 Thread Donovan Baarda

On Fri, May 03, 2002 at 04:41:17PM -0500, Dave Dykstra wrote:
 You shouldn't need to have such a long timeout.  The timeout is not over
 the whole length of the run, only the time since the last data was
 transferred.  It's a mystery to me why it quits after 66 minutes rather
 than 5 hours, but the real question is why it stops transferring data for
 so long.  Perhaps something went wrong with the network.  I can't connect
 to that server to try it, perhaps it is behind a firewall.

I have seen similar stalls with rsync running under apt-proxy with a
timeout. However, in this case the rsync is just transfering a single file,
so the directory transfer time is nothing and the stall occurrs after about
5 mins. With a timeout value of 15mins, the rsync client just sits there
untill the 15min timout happens when it dies.

I suspect something going wrong with the timeout negotiation between the
client and server. If I do the transfer manually without a timeout it works
fine. I haven't looked into this too hard yet, as I was not sure if it was
something apt-proxy related, or perhaps a kernel 2.4.17 issue, or maybe
something strange in the firewall in the middle etc.

however, my recent test doing it manualy suggests it's probably an rsync
problem.

 
 - Dave Dykstra
 
 On Tue, Apr 16, 2002 at 12:36:03PM -0400, Alberto Accomazzi wrote:
  
  Dear all,
  
  I've been trying to track down a problem with timeouts when pulling data from
  an rsync daemon and I have now run out of any useful ideas.
  The problem manifests itself when I try to transfer a large directory tree
  on a slow client machine.  What happens then is that the client rsync process
  successfully receives the list of files from the server, then begins checking
  the local directory tree, taking its sweet time.  Since I know that the process
  is quite slow, I invoke rsync with a timeout of 5 hours to avoid dropping the
  connection.  Howerver, after a little over 1 hour (usually 66 minutes or so), 
  the server process simply gives up.
  
  I have verified the problem under rsync versions 2.3.2, and 2.4.6 and up 
  (including 2.5.5), testing a few different combinations of client/server
  versions (althoug the client is always a linux box and the server always
  a solaris box).  It looks to me as if something kicks the server out of
  the select() call at line 202 of io.c (read_timeout) despite the timeout
  being correctly set to 18000 seconds.  Can anybody think of what the 
  problem may be?  See all the details below.
  
  Thanks,
  
  -- Alberto
  
  
  
  CLIENT:
  
  [ads@ads-pc ~]$ rsync --version
  rsync  version 2.5.5  protocol version 26
  Copyright (C) 1996-2002 by Andrew Tridgell and others
  http://rsync.samba.org/
  Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, 
IPv6, 64-bit system inums, 64-bit internal inums
  
  rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
  are welcome to redistribute it under certain conditions.  See the GNU
  General Public Licence for details.
  
  [ads@ads-pc ~]$ rsync -ptv --compress --suffix .old --timeout 18000 -r --delete 
rsync://adsfore.harvard.edu:1873/text-4097/. /mnt/fwhd0/abstracts/phy/text/
  receiving file list ... done
  rsync: read error: Connection reset by peer
  rsync error: error in rsync protocol data stream (code 12) at io.c(162)
  rsync: connection unexpectedly closed (17798963 bytes read so far)
  rsync error: error in rsync protocol data stream (code 12) at io.c(150)
  
  
  SERVER:
  
  adsfore-15: /proj/ads/soft/utils/src/rsync-2.5.5/rsync --version
  rsync  version 2.5.5  protocol version 26
  Copyright (C) 1996-2002 by Andrew Tridgell and others
  http://rsync.samba.org/
  Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, 
no IPv6, 64-bit system inums, 64-bit internal inums
  
  rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
  are welcome to redistribute it under certain conditions.  See the GNU
  General Public Licence for details.
  
  from the log file:
  
  2002/04/16 08:52:48 [18996] rsyncd version 2.5.5 starting, listening on port 1873
  2002/04/16 09:39:01 [988] rsync on text-4097/. from ads-pc (131.142.43.117)
  2002/04/16 10:51:36 [988] rsync: read error: Connection timed out
  2002/04/16 10:51:36 [988] rsync error: error in rsync protocol data stream (code 
12) at io.c(162)
  
  from a truss:
  
  adsfore-14: truss -d -p 988
  Base time stamp:  1018964639.2848  [ Tue Apr 16 09:43:59 EDT 2002 ]
  poll(0xFFBE4E90, 1, 1800)   (sleeping...)
  4057.4093   poll(0xFFBE4E90, 1, 1800)   = 1
  4057.4098   read(3, 0xFFBE5500, 4)  Err#145 ETIMEDOUT
  4057.4103   time()  = 1018968696
  4057.4106   getpid()= 988 [18996]
  4057.4229   write(4,  2 0 0 2 / 0 4 / 1 6   1.., 66)  = 66
  4057.4345   

Re: rsync md4sum code.

2002-04-28 Thread Donovan Baarda

On Sun, Apr 28, 2002 at 01:06:10PM +1000, Donovan Baarda wrote:
 On Sat, Apr 27, 2002 at 03:32:47PM -0700, Martin Pool wrote:
  On 27 Apr 2002, Donovan Baarda [EMAIL PROTECTED] wrote:
   G'day,
   
   I've been working on a Python interface to librsync and have noticed that it
   uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes
   via rsync and was originally written for samba.
  
  Tridge recently discovered a bug in that code that probably does not
  weaken the digest, but that may make it incompatible with standard
  MD4.  Basically, tail-extension is not properly carried out for blocks
  that are a multiple of 64 bytes in size.
 
 This would be nealy all blocks, as everyone would be using 2^n sized blocks
 where n5. If you meant to say ...that are _not_ multiple of 64 bytes...,
 then I would dare to suggest fixing this would not hurt anybody, but
 definitely record the affects. 
 
  I haven't had a chance yet to check how this affects rsync.  If it
  does, I suppose we should evolve the protocol to fix it.
  
  There's not meant to be anything special about it.  One of my TODO
  items was to replace it with a faster implementation.
 
 I'm not sure how the RSA implementation compares speed-wise, but given it is
 more correct, would there be major objections to replacing the samba md4
 with the RSA one in librsync? I guess I should benchmark and publish
 results...

Ok, preliminary results are in. Because I'm primarily interested in Python
interface stuff at the moment, I chose to make my benchmark by hacking
together swig wrappers for the libmd md4c.c and the librsync mdfour.c, and
whipping up a python test rig. The wrappers should have identical overheads.
I used pre-generated random data for the input.

The results for a Celeron-366 doing 10K md4sums on 4K blocks of random data
are; libmd 3.1secs, librsync 2.5secs. Both gave identical checksum results
so the bug Tridge found didn't rear it's head. 

Conclusion: the librsync md4 is pretty fast. However, it was also slightly
harder to swig-wrap in isolation from the rest of librsync, as it had a few
tie-ins with the rs_trace stuff.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: [rproxy-devel] Re: rsync md4sum code.

2002-04-28 Thread Donovan Baarda

On Sun, Apr 28, 2002 at 11:48:14PM +1000, Donovan Baarda wrote:
[...]
 Ok, preliminary results are in. Because I'm primarily interested in Python
 interface stuff at the moment, I chose to make my benchmark by hacking
 together swig wrappers for the libmd md4c.c and the librsync mdfour.c, and
 whipping up a python test rig. The wrappers should have identical overheads.
 I used pre-generated random data for the input.
 
 The results for a Celeron-366 doing 10K md4sums on 4K blocks of random data
 are; libmd 3.1secs, librsync 2.5secs. Both gave identical checksum results
 so the bug Tridge found didn't rear it's head. 

Additional results; libmd native module based on existing python md5module
instead of using swig wrapper; 1.6secs.

 Conclusion: the librsync md4 is pretty fast. However, it was also slightly
 harder to swig-wrap in isolation from the rest of librsync, as it had a few
 tie-ins with the rs_trace stuff.

also, the C implemention matters far less than how you interface it with
Python. For the swig wrappers I was using shadow class's which adds an extra
level of indirection. It looks like for the swig tests, 1.5secs was
overhead, which makes the librsync md4 look even more impressive...

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



rsync md4sum code.

2002-04-27 Thread Donovan Baarda

G'day,

I've been working on a Python interface to librsync and have noticed that it
uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes
via rsync and was originally written for samba.

Is there anything special about this code compared to the RSA md4sum code that
can be found with libmd http://www.penguin.cz/~mhi/libmd/;?

Python uses the RSA md5sum code for it's md5 module, and making a Python md4
module was as simple as sed 's/\(md\)5/\14/i' over the Python interface code.

I'm contemplating some cleanup code hacking on librsync and was wondering if
the samba based md4sum code was better than the RSA code, or if it is
worth switching.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync md4sum code.

2002-04-27 Thread Donovan Baarda

On Sat, Apr 27, 2002 at 03:32:47PM -0700, Martin Pool wrote:
 On 27 Apr 2002, Donovan Baarda [EMAIL PROTECTED] wrote:
  G'day,
  
  I've been working on a Python interface to librsync and have noticed that it
  uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes
  via rsync and was originally written for samba.
 
 Tridge recently discovered a bug in that code that probably does not
 weaken the digest, but that may make it incompatible with standard
 MD4.  Basically, tail-extension is not properly carried out for blocks
 that are a multiple of 64 bytes in size.

This would be nealy all blocks, as everyone would be using 2^n sized blocks
where n5. If you meant to say ...that are _not_ multiple of 64 bytes...,
then I would dare to suggest fixing this would not hurt anybody, but
definitely record the affects. 

 I haven't had a chance yet to check how this affects rsync.  If it
 does, I suppose we should evolve the protocol to fix it.
 
 There's not meant to be anything special about it.  One of my TODO
 items was to replace it with a faster implementation.

I'm not sure how the RSA implementation compares speed-wise, but given it is
more correct, would there be major objections to replacing the samba md4
with the RSA one in librsync? I guess I should benchmark and publish
results...

There would be backwards compatability issues for librsync and rdiff, but
I'm hoping that these can be dealt with simply, hopefully by just bumping the
major version number and documenting the issue. librsync is not as widely
used as rsync itself so I don't think it would matter that much.

I think the RSA/libmd code is more standard, and it would be wise for
librsync at least to adopt the more widely used code. It is certainly nice
that the md2, md4, and md5 API's are so interchangeable. The only reason not
to would be licencing issues, but I would guess if Python can include it and
be GPL compatible, then it should be OK.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: [rproxy-devel] rdiff deltas not very good compared to pysync, why?

2002-04-25 Thread Donovan Baarda

On Wed, Apr 24, 2002 at 09:46:09PM -0400, Shirish H. Phatak wrote:
 Hi Martin,
 
   I can definitely help in managing the librsync package, especially 
 since I will be maintaining our local tree anyway. As of now, I don't 
 have the resources to get the Windows stuff tested; however, I can 
 handle the Unix side of things. Maybe Donovan would be interested in 
 managing the Windows effort?

I am also interested in helping take over librsync mantainence. Ordinarily
I'm a pure linux development person, but in this case I was sponsored by
Accellion to do a Python wrapper that had to work under windows.

I'm also pretty familiar with how the algo works from my work on Pysync.

   I am fine with either Sourceforge or samba.org. I have used the 
 former quite extensively during Intermezzo development, but most of my 
 usage was CVS access, so I don't think either site would make much of a 
 difference.

I really like a bug-tracker and the extra stuff SF gives you. I also find SF
is a fairly centralised place for it all. However, if samba.org can provide
the same facilities...

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Patch: update popt/ to 1.5.1

2002-04-21 Thread Donovan Baarda

On Sat, Apr 20, 2002 at 12:23:16PM -0700, Jos Backus wrote:
 This patch updates the files under popt/ to the latest vendor drop. The only
 change is the inclusion of a FreeBSD-specific patch to popt.c. This is needed
 in case somebody decides to build rsync on that platform without using the
 port. I'm not happy about the wording in popt/README.rsync so I may change it.
 
 The patch is available at
 
 http://www.catnook.com/rsync-popt-1.5.1.patch

I have just submitted a patch against librsync that included changes to the
popt included in it for cygwin and MSVC compatability. Near as I can tell
the popt I was patching was v1.3.

Do you (or anyone else) know if the 1.5.1 version fixes cygwin and MSVC
compilation? (by MSVC I mean doing configure CC=cl --build=windows32 in
cygwin). Should I be submitting my changes somewhere else upstream?


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Can rsync update files in place?

2002-04-16 Thread Donovan Baarda

On Tue, Apr 16, 2002 at 09:43:12PM +0200, Paul Slootman wrote:
 On Tue 16 Apr 2002, Martin Pool wrote:
  On 16 Apr 2002, Paul Slootman [EMAIL PROTECTED] wrote:
  
   Is there a way to get rsync to not create a new file while transferring
   and then rename it, but to instead update the existing file in place,
   i.e. simply write those blocks that have been updated and leave the rest
   alone?
  
  That would be pretty hard, though not impossible.
  
  I suspect it would be roughly as hard to put it into rsync as to 
  write a new program from scratch, depending on what features you need.
 
 That's what I've been thinking :-)

If someone wanted to experiment with this kind of thing, plug pysync would
be an ideal prototyping tool/plug.

It's a Python implementation of rsync that I wrote for the express purpose
of being an easy way to demonstrate and experiment with the algorithm. It
includes both rsync and xdelta algorithms, and is about 500 lines in total,
but half of that is comments.

http://freshmeat.net/projects/pysync/

I am currently working on including a swig'ed wrapper for librsync, that
should mean you could use it for real-world (Python) applications. The plan
is to have this ready by the end of this week.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync and debian -- summary of issues

2002-04-12 Thread Donovan Baarda

On Fri, Apr 12, 2002 at 01:28:01PM +1000, Martin Pool wrote:
 On 12 Apr 2002, Brian May [EMAIL PROTECTED] wrote:
 
  I think some more details is required regarding rproxy.
[...]
  AFAIK, it solves all the problems regarding server load discussed in
  rsync, doesn't it???
 
 Why did you think that?
 
 rproxy addresses a rather different problem to rsync: for example, it
 transfers only one file at a time, not whole directories.  No, rproxy
 does not have a magic way to do the delta computation in zero time.
 Compared to rsync, rproxy has the advantage of cleaner code (imho),
 but the disadvantage that no optimization work has been done.

The big problem with rproxy is it's implemented in perl (perl: crypto for
algorithms). librsync on the other hand is a nice piece of work and _should_
be used for a re-implementation of rproxy and/or rsync.

I have recently got sponsorship to add a python front-end to librsync as an
extension to my pysync code, which I should have done by end of next week.
After this I will probably do some work on adding rproxy http extensions
into something like medusa or twisted.

If rproxy includes the signature in a HEAD responce, and supports ranged
requests, delta calculation can be moved to the client with no changes to
rproxy as follows;

1) client sends HEAD request to get upstream signature.
2) client does reverse-delta calculation.
3) client applies reverse-delta using ranged-requests to fetch required
parts from upstream.

You touched on this in your page, but not in relation to rproxy. I believe
the client-side delta stuff you mentioned was not using rproxy http
extensions at all, just adding some sort of cgi that returns signatures for
objects on the server.

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: Rsync and Java

2001-03-11 Thread Donovan Baarda

On Sun, Mar 11, 2001 at 08:52:25PM -0500, Andre John Mas wrote:
 Martin Pool wrote:
  
  On 11 Mar 2001, Andre John Mas [EMAIL PROTECTED] wrote:
   I am looking at possibly doing an rsync port to Java, depending on
   time constraints, and I would like to know whether anyone has already
   set up such an effort. If there is one already under way, then it
   wouldn't make sense for me to duplicate the effort.
  
  I don't think there is one yet.  You're welcome to do it.  I think
  Java would work quite well for this sort of thing.  There is some
  example code in other languages (e.g. pysync) that may guide you.
 
 I will look at the python code, as suggested, as this should help
 me understand the necessary components. 

A little warning about the pysync... it was developed without attempting to
be compatible in any way with existing rsync/xdelta/libhsync etc. It
performs the same algorithms as them, but the data formats take advantage of
python data structures. It also favors algorithmic cleanness over
performance tweaks, for example it uses a pure adler32 for the rolling
checksum instead of the hybrids used by rsync and xdelta.

It also has zero network-transport stuff; it just calculates signatures,
calculates deltas, and applies them. Any network transport stuff is left to
the reader as a simple excersize :-)

This has it's advantages; the code is very simple. It is now just over 500
lines, of which nearly 50% are comments, and this includes both rsync and
xdelta style deltas. Actualy, the xdelta delta stuff is new... I'll put up a
new release that has it in it right now. I've also done some tweaks that
have made it 33% faster than version 1.2...

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--




[Module] pysync 1.2

2001-03-01 Thread Donovan Baarda

   pysync 1.2
   --

A Python implementation of the rsync algorithm.

This is a demonstration implementation of the rsync algorithm in Python.
It is not fast and is not optimised. The primary aim is to provide a
simple example implementation of the algorithm for reference, so code
clarity is more important than performance. Ideas have been liberaly
taken from libhsync, xdelta and rsync.

Release 1.2 introduced the new zlib-like API, allowing for incremental 
calculation of deltas and applying patches. The comments at the top of 
pysync.py explains it all; 

# Low level API signature calculation 
sig=calcsig(oldfile) 

# Low level API rsync style incremental delta calc from sig and newdata 
delta=rdeltaobj(sig) 
# or for xdelta style incremental delta calc from oldfile and newdata 
# delta=xdeltaobj(oldfile) 
incdelta=delta.calcdelta(newdata) 
: 
incdelta=delta.flush() 

# Low level API applying incremental delta to oldfile to get newdata 
patch=patchobj(oldfile) 
newdata=patch.calcpatch(incdelta) 
: 

The rdeltaobj.flush() method supports R_SYNC_FLUSH and R_FINISH flush modes 
that behave the same as their zlib equivalents. Next on the TODO list is 
incremental signature calculation, and further cleanups. Eventualy I plan to 
create a md4sum module and move the rolling checksum stuff into C code. 

The performance has been marginaly hurt by this new API. Interestingly, the 
python profiler shows that most of the time is wasted performing string-copies 
when taking slices from input buffers, not actualy doing the rsync. This 
suggests that significant performance increases might be achievable by re-
arranging things a bit, rather than moving python code into C. 

I have also added a pysync-test.py script for thorough formal testing of 
pysync. It generates/reuses random test files that make pysync really work 
hard, verifying that it behaves as it should. 

Incidentaly, release 1.2 also fixed a rather embarassing bug introduced in 
release 0.9's adler32.py that corrupted the rolling checksums, resulting in 
heaps of missed matches. This bug caused seriously bad performance and very 
large deltas.

   URL:  http://freshmeat.net/projects/pysync/
  Download:  ftp://minkirri.apana.org.au/pub/python/pysync/pysync-1.2.tar.bz2

   License:  LGPL

  Categories:  Encryption/Encoding

Donovan Baarda ([EMAIL PROTECTED])
http://sourceforge.net/users/abo/

--
ABO: finger [EMAIL PROTECTED] for more information.




Re: Q (feat. req.) rsync: using UDP, multicast

2001-02-14 Thread Donovan Baarda

Quoting Martin Pool [EMAIL PROTECTED]:

 On 14 Feb 2001, "Ph. Marek" [EMAIL PROTECTED] wrote:
  Hello once more,
[...]
  I know that's not easy achievable. But I think it could suffice to say
  "push the changes made from backup/filexyz to current/filexyz on IP
  224.x.x.x:y" and everywhere else "use IP 224.x.x.x:y to receive
 changes".
[...]
  Hmm, thinking about this I think that that could be done using diff
 and
  spooling the difference on multicast.

Sounds like an application for xdelta. If you have copies of both backup and 
current on the "source" machine, xdelta can calculate a more optimal delta than 
rsync. Then you just broadcast the delta.

--
ABO: finger [EMAIL PROTECTED] for more information.




Re: 60 byte file hangs solaris 8 rcp

2001-01-15 Thread Donovan Baarda

On Mon, Jan 15, 2001 at 10:00:31AM -0600, Dave Dykstra wrote:
 We have run across a Solaris 8 system that will hang the TCP connection
 whenever we try to rcp in the attached 60 byte file from any other system.
 The system administrator is not very responsive so I don't know if he has
 contacted Sun about it or not yet, and I don't have easy access to another
 Solaris 8 system.  There must be something special about the data because
 it started out in another portion of a larger file (first noticed by rsync)
 and we were able to narrow it down to this small piece.  I also don't know
 if there is any special networking hardware on this machine, all I know is
 that it is an Enterprise 450 with dual processors.

Wow! That must be the definitive example of "narrowing it down".

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--




pysync: Python implementation of rsync algorithm.

2000-12-10 Thread Donovan Baarda

G'day again,

The first version of pysync turned out to be a bit buggy... release early
and release often is my excuse :-)

Actualy, I tripped up over an obsure bug in Python's zlib to do with
decompression of sync flushes. The workaround is to decompress in smaller
chunks. This makes the code more complicated but also makes it a bit less
memory hungry when applying deltas to large files. There was another bug
I've fixed that would cause patches to fail if the first block wasn't a
match.

I've also cleaned up the code a little, and tested it a lot more this time.
I've even tested it on some pretty big files, and I'm pleased that it
doesn't seem to be as slow as I thought it would be.

The new version is at;

ftp://minkirri.apana.org.au/pub/python/pysync-0.6.tar.bz2


-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--




Python implementation of rsync algorithm, was Re: rdiff - does it exist?

2000-12-07 Thread Donovan Baarda

On Wed, Dec 06, 2000 at 11:43:18AM +1100, Donovan Baarda wrote:
G'day,

 Quoting Martin Pool [EMAIL PROTECTED]:
[...]
  I'm changing libhsync to have an API much like zlib/bzlib, in the hope
  that this will make network app integration simpler.
 [...]
 
 I am 90% of the way through an "example implementation" of exactly this in pure 
 python. I should be finished within a week (it's only a couple of hours work, 

Well, I finaly got around to finishing this off enought to seek comment. I
have implemented the rsync algorithm in pure python. It's not fast, but it
works, and is very simple. The whole thing is about 300 lines, and half of
those are explanations and musings in comments.

I've basicly implemented the commandline api Martin was talking about... so
you can generate signature files, generate delta files, and apply delta
files. There is no network transport stuff, though experienced pythonistas
could easily whip something up if they wanted to. The signature and delta
files are just pickled python objects, so they are not compatible with
anything else.

My bandwidth sucks a bit, so please be gentle. If I start getting hammered
I'll put it up on sourceforge code-snippets or something. For now, it can be
downloaded from;

ftp://minkirri.apana.org.au/pub/python/pysync-0.3.tar.bz2

I must confess that the source to xdelta, libhsync, and rsync all scared me
a bit which is why I thought I'd write a simple version. Experienced rsyc
coders, please don't laugh, but I'd appreciate constructive critisism. In
particular, I'd like to know if I've missed any tricks in the algo'. I'd
like to think that this could form a framework for experimenting with
various tweaks to the algorithm, but I'm probably dreaming :-)

Feel free to forward this to anyone or any other lists you might think would
be interested...

-- 
--
ABO: finger [EMAIL PROTECTED] for more info, including pgp key
--