Re: [librsync-users] MD4 second-preimage attack
On Tue, 2006-02-21 at 14:58 -0800, [EMAIL PROTECTED] wrote: Hi, A year ago we discussed the strength of the MD4 hash used by rsync and librsync, and one of the points mentioned was that only collision attacks are known on MD4. Well, a recent paper by Wang et al [1] shows a several second preimage attacks. First, there's an algorithm which, given a random message and its MD4 hash, finds another message having the same hash with probability 2^-56. Second, if the attacker can also alter the original message (and thus its hash) slightly, he can find a second message having the same hash with just 2^27 MD4 invocations. For the record, here it is in the SF mailarchive. http://sourceforge.net/mailarchive/forum.php?forum_id=29760max_rows=25style=flatviewmonth=200404viewday=5 If I understand correctly, provided we add random seeds for blocksums, these weaknesses would only make attack case 4) easier. Doubtless, even stronger attacks will soon be found. MD4 (with known seed) is thus completely broken, making rsync batch mode and librsync unsafe to use in malicious environments. Please do consider phasing out MD4. Remember we only use a small part of the md4sum for the strong blocksum, and hence are already using less than the full md4sum. We don't really need that many bits of hash for the blocksum. The important thing for the blocksum is speed. I think md4 + random seed is pretty much the best useful hash bits per second for blocksums. We can adjust the blocksum size to use more bits to compensate for the fact that ~20% of the bits are insecure if we want. If someone can provide detailed evidence that some other algorithm provides more useful-hash-bits/sec, taking into account the random seed, I'll be convinced :-) In any case, the first thing that will be tackled is changing to use the libmd/openssl API for hashes, which will make switching what hash we want to use painless... a configure option? The fastest hash function with no interesting known attacks is SHA-256, which is still somewhat expensive (though this can be partially addressed by the meta-hash idea discussed on the librsync list last July, Re: more info on 25gig files). SHA-1 may also be OK for a while despite the known collision attacks, and has acceptable speed. The whole file checksum is the real protection against attacks, and for this I think we should use something strong, which means enough useful hash bits to protect against attack. Remember we only calculate this once over the whole file, not once for every potential block match. Also, as discussed in detail earlier: to thwart some attacks, rsync batch mode and librsync should be fixed to use a random seed. [...] There are a few things I'm proposing to add to the TODO; 1) Add whole filesum checksums in the signature and delta. The signature will have the oldfile filesum, the delta will have the oldfile and newfile filesums. Applications will optionally be able to verify the oldfile filesum against the delta before applying patches, and patch will evaluate the newfile filesum and return an error if it doesn't match. 2) Add a blocksum/filesum seed in the signature and delta. This will be used as the seed for both filesums and blocksums. It will default to random when generating signatures, but can optionally be specified on the command-line. The delta and patch operations will use the seed from the corresponding signature. 3) Add variable blocksum size in the signature. An optimal blocksum size can be calculated from the filesize, blocksize, and failure probability. The blocksize can be specified on the commandline, but will default to sqrt(filesize). The blocksum size will be calculated the same as rsync. Note that for this to work with streams, the truncated md4 blocksums will not be emitted when generating the signature until the oldfile input is completely read (to get the filesize). This ties in with feature 4) 4) Add blocksum collision detection when generating the signature. This will involve keeping the whole md4 blocksums and checking that the truncated strongsum doesn't match against an existing block that has a different whole md4sum. If a collision is detected, an error will be returned. It is up to the application to do things like re-try with another seed (Note you can't retry for piped input). I'm not entirely sure about using the seed for the filesums too. It may be sufficient and convenient to use an unseeded SHA-1 or something, and the delta would not need to include the seed. However, it is a no-overhead addition that should help. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Rsync for program loading on embedded platforms
On Wed, 2004-06-16 at 18:01, Greger Cronquist wrote: Hi again, This issue solved itself rather annoyingly since we use flash memory in our newer devices (which had slipped my mind). So we must erase the memory first and then load. Last time I played with flash, you could erase a block at a time. The flash erase and write times are quite long, and flash has a limited number of write cycles, so avoiding any re-writing is a good idea. The simple solution here is a flash block aligned signature, and do a block by block compare/transfer, so only blocks that are changed are transfered/erased/written. There is no benefit to a byte-aligned rolling window/checksum, unless you do some fancy partial-in-RAM updating tricks before erasing/writing the blocks. Another complicated variant that might be useful if your data size is significantly larger than the flash block-size would be an rsync-like algorithm that rolls a flash-block-aligned (instead of byte aligned) window (rsync-block) through the flash. This would only be worth it if the window was at least 8 times the size of the flash-blocks. If you want to experiment with these kinds of algorithms, I thoroughly recommend pysync... it would be easy to implement and test all these variants using it. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Rsync for program loading on embedded platforms
G'day, From: Greger Cronquist [EMAIL PROTECTED] [...] compiled binaries are often very different for only minor source changes. It would be worth analysing your data to see if there are more [...] Is that really true---that the binaries differ that much? Isn't that mostly due to relocating the different functions to other areas of the binary? Which, I guess, might be hard for in-place rsyncing. I just did a quick test with two different binaries using xdelta and rdiff and the uncompressed deltas were less than 10% of the original size (xdelta of course being the smallest). So I have hopes that some kind of diff-related (even if it means keeping the old binary on a PC which we do anyway for tracability reasons) might work. Depending, of course on the overwrite and cpu issues. This depends a lot on the compiler and options used. The more optimized the compiler, the more you often find small source changes resulting in significant binary changes. Stuff written in assembler usually doesn't change much though... unless you change some small macro that is used all over the place. The checksum size you can use is a function of the data size, block size, and your acceptable failure rate. There are threads where this has been analysed and the latest rsync incorporates dynamic checksum and [...] Please correct me if I'm wrong, but don't most of these threads deal with finding the optimal block size depending on the file size? For an No, there are threads where the optimal blocksum size is discussed. embedded system we might well use small blocks just to be able to use, say, a 16 bit checksum that is much faster to compute. Or even 8 bit checksums for that matter. If I understand it correctly, rsync only ever uses one (well two if counting both the rolling and the md4 checksums) checksum implementation varying the block size it's calculated on. There is a relationship between the block size, file size, and blocksum size that can be used to calculate the probability of failure (ie, encountering two blocks with the same blocksum, corrupting the result). Given an acceptable failure rate (say, 1 in 2^10 file transfers) and file size, you can pick a block size and thus calculate your blocksum size. eg, for a small file; file_len = 64K block_len = sqrt(file_len) = 256 bytes blocksum_bits = 10 + 2*log2(file_len) - log2(block_len) = 34 bits The 32 bit rollsum alone is not enough to give a better than 1 in 2^10 chance of failure... you need at least two more bits of checksum, which you may as well round up to the nearest byte. Also, the rollsum is not a very good checksum, so relying on it to provide a good 32 bits of the checksum might not be wise... particularly if you start taking into account maliciously crafted blocks. As of yet, this is more of an academic interest spurred by the annoying delays in the compile-download-test loop... :-) Ahh... remember them well :-) I distinctly remember one particular system that attempted to minimize the download-test loop by doing block-wise checksums... most of the time it didn't save you much, and rsync would have been a distinct improvement. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Rsync for program loading on embedded platforms
On Tue, 2004-06-01 at 17:44, Greger Cronquist wrote: Hi, Has anyone considered using the rsync algorithm for loading programs on embedded platforms? I.e. with low cpu power (say, 16-25 MHz ARM7) and low memory requirements (programs are 128 KB, and the rsync algorithm I've been doing some work on librsync. As I come from an embedded background, I've always been conscious of embedded requirements as I worked on it. must use little ROM and RAM). Or is it just not worth the effort? The algorithm sacrifices CPU to minimise data transfer when updating it. For it to be worth it, you must already have similar data that needs updating, and data transfer must be expensive relative to CPU. Often with embedded systems you are loading programs from scratch, so there is nothing to update. Even if you are updating programs, compiled binaries are often very different for only minor source changes. It would be worth analysing your data to see if there are more application specific ways to minimise updates (like partitioning data into discrete parts that can be updated independently). I guess you'd want to at least modify the checksum calculations and use shorter and faster checksums (as the data to be transferred is smaller than on workstations), and you'd need that file overwrite patch that's been floating around. The checksum size you can use is a function of the data size, block size, and your acceptable failure rate. There are threads where this has been analysed and the latest rsync incorporates dynamic checksum and block sizing based on that discussion. librsync doesn't have it yet, but it could easily be added. I'd be interested to hear of anyone using or contemplating the rsync algorithm on embedded systems. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
On Wed, 2004-05-19 at 06:10, Chris Shoemaker wrote: On Tue, May 18, 2004 at 10:06:52AM -0400, Alberto Accomazzi wrote: Chris Shoemaker wrote: [...] that the feature is useless, but just caution people that they need to understand the assumptions that this use of rsync is based upon. Also, I would suggest checking out other diff/patch tools such as rdiff-backup or xdelta. I looked at theses but didn't see how they could help me in my situation (same as what you described). Am I missing something? Perhaps he failed to go low enough :-) rdiff-backup is built on librsync, which has a command line tool rdiff. rdiff has seperate command line options for; generate signature, calculate delta, and apply delta. This allows you build your own batch mode tools. Note that it is only one file at a time; you would have to build your own directory walking tools etc. xdelta does the same thing, except it doesn't use signatures. It is much more like a efficient generic binary diff/patch tool. This makes it unsuitable for some applications, but its deltas are much more efficient. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: gzip-rsyncable.diff
G'day, From: Mike McCallister [EMAIL PROTECTED] [...] http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff Since the patch was created a while ago and is not part of the default gzip yet, I was wondering if rsync users on this list had any opinions on the patch regarding its effectiveness and/or any problems it may cause with gzip. Is this patch the way to go, or am I better off trying out a different compression algorithm (not gzip related) altogether? I don't know how cleanly the patch will apply to recent gzip's, but it should be OK as gzip itself seems to be pretty stable these days. I do know how it works. After posting my firm opinion that it was not possible, I was proven wrong. The idea it uses is very clever and highly effective. It will have a negative impact on the compression because it resets the compressor at semi-regular intervals. It will also miss some matches that you would get on uncompressed data as it requires a slightly longer run of match data to re-syncronise. However, the compression losses and missed matches are reported to be minimal. I'd say it is worth trying. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day, From: Wayne Davison [EMAIL PROTECTED] On Thu, Apr 08, 2004 at 03:50:48PM +1000, Donovan Baarda wrote: I think I've just realised what you were getting at; if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. Not so. Copy the file once, and you'd get all the data you'd need to create a new local file using a known-signature attack (as long as the input file didn't change, and that's easy to predict for many files). I think between Eran Tromer and myself we have shown that for an md4 blocksum with 'n' bits and a file with 2^m blocks, you will have to try 2^(n-m) blocks to find a match to a known signature. For librsync's 64 bit strong_sum, even a 4G file will need 2^43 attempts, which is sufficiently hard. Sure, you have all the data you need, but that doesn't make it easy :-) Assuming no md4sum exploits are used... I also don't like the doubling of the I/O-cost on the sending side, so I don't think this is a good way to go. I agree... a random seed gives as good or better protection without the double parse for the signature. However, it does mean you don't get the same signature for the same data. Perhaps there are some other ways to make the signature repeatable without requiring a double parse? Is using the md4sum of the first block only as a seed secure enough? I don't think it is. I think any non-random signature seed needs to take into account the whole file for it to be secure, otherwise it reduces to the birthday algorithm for crafting clashes. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day, From: Eran Tromer [EMAIL PROTECTED] [...] librsync needs a whole file checksum. Without it, it silently fails for case 1), 3), and 4). librsync could benefit from a random checksum_seed. It would need to be included in the signature. Without it librsync is vulnerable to cases 1) and 3). [snip] rsync shouldn't need a fixed seed for batch modes... just store the seed in the signature. using a fixed seed makes it vulnerable to 1) and 3). I fully agree with your analysis. I'll just note that in many situations, case 2 can be elevated to case 3 simply by transferring the file twice. Yeah... did you see my followup post about the posiblity of using the whole-file checksum as the checksum_seed for the blocksums? I think it would be a good idea for librsync. It does require a double-parse to generate the signature, but is otherwise quite nice. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day again, From: Eran Tromer [EMAIL PROTECTED] [...] if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. You can't manipulate individual blocks without it affecting every other blocksum, but the signature for the same file is always the same. Nice :-) Nice indeed, but the cost is enormous: you'll have to read the file twice. When syncing a mostly-unchanged file that's larger than the disk cache, that means doubling the runtime (and disk load) on the receiver's side. Also, it means 'rdiff signature' and equivalents won't work on But the vast majority of the load is in the delta calculation on the sender's side. arbitrary streams. Currently the receiver can tee the output of 'rdiff patch' into both the output file and an instance of 'rdiff signature', in order to save the IO of re-reading the file upon the next sync; this won't work anymore. (How about a built-in option for that, BTW?). True... that is a nice feature... you are slowly turning me off the whole-file checksum as seed :-) More thoughts on using the wholefile sum as the seed; at first I thought this would still be vulnerable to case 3). Using a any single block as the original file and trying to find any block that matches means you still have birthday algorithm numbers of blocks to check (ie 2^(n/2)). However, each block comparison requires the recalculation of the target blocksum using the original blocksum as the seed, resulting in non-birthday algorithm number of checksum calculations (ie, 2^n). I'm afraid it's still vulnerable to case 3 (a pair of target and original files with matching blocks). For simplicity consider single-block files. In this case what you've done is simply to replace the hash function f(x) = truncate(MD4(x,fixed_seed)) with the hash function f'(x) = truncate(MD4(x,MD4(x,fixed_seed))) Not quite... it's f(x,y) = truncate(MD4(x,MD4(y,fixed_seed))), where x and y are the two blocks to be compared. This means you have to re-calculate the hash for every compare, not just once for every block. which happens to be the same as hashing two copies of x in sequence. But the birthday attack doesn't care what's the hash function; it only cares about its input and output sizes (which we didn't change) and that the function is sufficiently random. It also assumes that the hash is done once per sample input, not once per compare. Sure, you still only need to try 2^(n/2) blocks, but you need to calculate 2^n hashes, and that's what really hurts. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
On Thu, 2004-04-08 at 12:36, Martin Pool wrote: On 5 Apr 2004, Donovan Baarda [EMAIL PROTECTED] wrote: librsync needs a whole file checksum. Without it, it silently fails for case 1), 3), and 4). Yes, a whole-file checksum should be used with it. Presumably something stronger than md4 like SHA-1. md4 is probably good enough for most applications. I know it's not as secure as others, but when you take into account the requirement that the signature match as well, compromising it becomes much more complicated. I think the only question is whether this should be done internally in librsync, or as a separate process. I can see arguments either way. In some cases you might prefer to actually store an signed signature using something like GPG. Yeah... good point. The rdiff program should probably use a whole file md4sum though. librsync could benefit from a random checksum_seed. It would need to be included in the signature. Without it librsync is vulnerable to cases 1) and 3). Random with respect to what? I think it would be nice if repeatedly summing identicaly files gave identical signatures. Maybe it can vary depending on only the input data... The problem is repeatability is what makes it vulnerable. If the content fully determines the signature, the content can be crafted to produce a particular signature. It is relatively easy to calculate two different blocks with the same blocksum if the checksum_seed is known. librsync could be modified to detect when two blocks had the same blocksum in a signature, and permutate the checksum seed in a deterministic way to try repeatedly until a signature without collisions is found. This would give repeatable signatures and protect against case 1), but not case 3). It would also allow crafted files that force m/2 recalculations of the signature. I think I've just realised what you were getting at; if the checksum_seed is based on something like the whole file md4sum, it becomes repeatable, but unpredictable. You can't manipulate individual blocks without it affecting every other blocksum, but the signature for the same file is always the same. Nice :-) This would fit in nicely with adding a whole-file checksum to the signature... generate the wholefile md4sum, and use that as the starting seed for the individual blocksums. The wholefile sum becomes the seed. This doesn't allow for permutating the seed if it happens to result in blocksum clashes. However, given that the probability if this happening is pretty astronomical (when using the wholefile sum seed) and will be caught by a whole-file checksum miss-match, do we care? It is far more likely to get false block matches when calculating the delta (because sums are calculated at every byte boundary, not just at block boundaries). More thoughts on using the wholefile sum as the seed; at first I thought this would still be vulnerable to case 3). Using a any single block as the original file and trying to find any block that matches means you still have birthday algorithm numbers of blocks to check (ie 2^(n/2)). However, each block comparison requires the recalculation of the target blocksum using the original blocksum as the seed, resulting in non-birthday algorithm number of checksum calculations (ie, 2^n). I've Cc'd this to the rsync list because they might get some ideas from it. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
librsync and rsync vulnerability to maliciously crafted data. was Re: MD4 checksum_seed
G'day again, Just revisiting an old thread after some more thought. Eran and I were discussing the vulerability of librsync and rsync to deliberate attempts to craft blocks with matching signatures but different content. It turns out it's disturbingly easy. Here's a bit of context; From: Donovan Baarda [EMAIL PROTECTED] From: Eran Tromer [EMAIL PROTECTED] Donovan Baarda wrote: [...] I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox. Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered pairs of observed blocks, and each pair gives a collision with probability 1^-64, so you need N=~2^32. [...] Doh! Of course... it was exactly the same paradox that showed how high the probability of a collision was in rsync for large files. You are not searching for a collision with a particular block, just between any two blocks. I'm not entirely sure about rsync, but apparently such an attack will make it slow down, but it will not corrupt files. Librsync on the other hand will (currently) silently corrupt files. Before librsync users panic; how much to you really care that a maliciously crafted file is corrupted? In rsync this can be used as a performance degradation attack. However, librsync by default uses a psedo-random checksum_seed, that makes such an attack nearly impossible (however... the original discussion started around using a fixed checksum_seed in batch mode). I've done a bit more thinking on this, and there are four ways crafted blocks can be use; 1) two crafted blocks in the original file 2) two crafted blocks in the target file 3) a crafted pair of target and original files with matching block(s) 4) a block in the target crafted to match a block in the original For case 1), it is possible to detect collisions at signature generation time. Using a random checksum_seed makes it nearly impossible. Combining the two, selecting a new random_seed when a collision is detected, would make this pretty much impossible. Note that a random_seed can be used for batch mode and librsync, but the seed would need to be included in the signature. For case 2), the crafted blocks would not match the original, and hence would be miss data that is directly copied. Currently librsync and rsync don't do any sort of target self referencing matches, so this cannot affect librsync or rsync. Note that there has been discussion on adding this to librsync, and it is supported by the vdelta format. When/if such a thing was added, similar conditions to 1) apply to the target matching algorithm. Any such addition should probably use xdelta style matching anyway, which compares the data, not a strong_sum, making it immune to crafted data. For case 3), crafting a pair of files is made nearly impossible by using the protections for case 1). It may be possible to craft a file that has a reduced set of possible signatures for even a random_seed, but this would require vulnerabilities in the strong_sum and/or random seed generator. Barring such a vulnerability, a random_seed randomises the signature for the original, reducing this to case 4) Case 4) is the nasty one. The protection against case 1) ensures that the signature is randomised for a given original, but once you have the signature, you know the random_seed, and all elements of randomness to the rest of the algorithm are gone. However, this is not a true birthday paradox situation; you are not trying to find any two blocks that match, but a block that matches any block in the original. This is no where near as easy. The number of attempts required is; n ~= 2^(b'-m) where: n is the number of attempts required to find a collision m is from there are 2^m blocks in the signature b' is the number of useful bits in the signature (excluding rollsum) Interestingly, the dynamic blocksum heuristic now in rsync is B = sqrt(s) b = 10 + 2*log2(s) - log2(B) Where: B is the block size s is the file size 32 bits of 'b' is the rollsum which means, assuming 32 bits of rollsum is useless and hence should not be included in b'; m = log2(s)/2 b' = -22 + 2*log2(s) - log2(s)/2 = -22 + (3/2)*log2(s) so the number of attempts is n = 2^(-22 + log2(s)) so the larger the file, the harder it is to compromise, because 'b' grows faster than 'm'. Note that rsync has a lower limit on b' of 16, which corresponds to a filesize of 32M. This equates to n = 2^3 attempts, which is rather low... in the realm of on the fly calculation time. For librsync which uses a fixed b'=64, and default B=2048, this becomes; n = 2^(64 - log2(s/(2^11))) = 2^(75 - log2(s)) Which is 2^43 for a 4G file... not easy at all. For rsync, a malicious sender could easily craft blocks that match the signature but have different content. Big deal. A malicious sender doesn't have
Re: sort of like a diff patch file
G'day, On Mon, 2004-03-29 at 13:37, Steve W. Ingram wrote: Hi there, I was wondering if there was anyway to use rsync to effectively create a 'diff' file? is this a FAQ yet? A) rdiff. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: sort of like a diff patch file
On Mon, 2004-03-29 at 15:31, Greger Cronquist wrote: For just creating diffs, xdelta is even better (in that it creates smaller diffs very quickly) xdelta requires that you have local access to the two files you want to diff... librsync's rdiff allows you to calculate a small signature which you diff against, allowing you do calculate diffs without local access to the original you are diffing against. This can be handy for many applications. We've lost the original application enquired about... but certainly xdelta should be better if you have local access to both files. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: MD4 checksum_seed
G'day, From: Eran Tromer [EMAIL PROTECTED] Donovan Baarda wrote: [...] I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox. Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered pairs of observed blocks, and each pair gives a collision with probability 1^-64, so you need N=~2^32. [...] Doh! Of course... it was exactly the same paradox that showed how high the probability of a collision was in rsync for large files. You are not searching for a collision with a particular block, just between any two blocks. I have seen several variants for the rollsum; rsync: uses a basic fletcher. Except that strictly speaking, Fletcher uses one's-complement arithmetics. [...] Ahh, someone that actually knows :-) I wasn't sure of the exact definitions and names. Quoth RFC 1950: That 65521 is prime is important to avoid a possible large class of two-byte errors that leave the check unchanged. Not much of a difference in adversarial setting, though. And not much of a difference I think for rsync in general... I wonder just how large the possible large class is? librsync's offset probably does nothing, but I have a feeling it might help a bit for blocksizes less than 256 bytes, because it at least helps use more of the checksum domain for the s1 part. Without it you don't end up using the high bits of s1 at all for small blocks. It doesn't matter. You just shift the small range of possible values -- instead of using a small range of small numbers you'll use a small range of large numbers. [...] Yeah... that makes sense... It might be possible to craft a mapping like xdelta's but using a special sequence calculated for best effect, much like the polynomials in crc32 and crc64. Possibly, for natural or random data (as opposed to adversarial). But in such contexts, random choice is often close to optimal with high probability. I'll take your word for it :-). Actually I think the polynomials in crc are crafted for common line errors. I'm not sure you can do much crafting without making certain assumptions about the types of errors. In our case I don't think we can make assumptions about the types of errors. Have you looked at the librsync rollsum.c implementation in CVS? It replaced rs_calc_weak_sum and should be a fair bit faster. Uhm. Usually loop unrolling is best left for the compiler, and it's counterproductive with some modern processors. Is rollsum.c faster than the old implementation with -O3? I can't remember... there were other cleanups associated with changing to using rollsum that may have impacted. The manual unrolling was taken straight from the adler32 implementation in zlib. I come from an embedded system background where the C compilers are often pretty basic, and manual unrolling makes a big difference. I've been thinking of puting together tests for librsync using the check framework so that we can easily test and time things like this. Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: MD4 checksum_seed
On Tue, 2004-03-16 at 10:44, Eran Tromer wrote: Hi, On 2004/03/15 03:49, Donovan Baarda wrote: [...] Just to be sure, I wrote a quickdirty collision search code for the rsync and librsync hashes. I used a generic collision-finding algorithm (namely Gabriel Nivasch's multistack variant of the generalized Rho method) to search for a truncated MD4 collision among the blocks whose Fletcher sum is zero. It took 13ms to find a collision for the default parameters of rsync version2.6.0, with either checksum_seed=0 (disabled) or checksum_seed=32761 (the value fixed for batch mode). With a 4 byte strong hash, which is the most you're likely to see with Schultz's code, the above took 1.4 seconds. For librsync with default parameters (8-byte strong hash), finding a collision took a couple of hours. Here are examples of colliding pairs for each of the above case: http://dl.tromer.org/misc/rsync-collisions.tgz [...] with an 8-byte hash that means you tested approximately 2^64/2 crafted fletcher busting blocks to find a collision, yeah? Did that really take only a couple of hours? Man, computers are heaps faster now! By my calculations, even processing them at one block every micro-second, that would have taken over 272 years. Are you sure? You must have got _very_ lucky, or there is some serious flaws in md4. Perhaps there is a serious problem with how we are truncating the md4? I just checked your librsync sample blocks... you are right. The first 8 bytes of md4sum are the same, the rollsums both evaluate to 0, and yet the data is different. How the hell did you do that? I glanced at Gabriel Nivasch's stuff but I can't see how it could have shortcut it so much... The 2^(k/2) attempts assumes random attempts... how does this compare to crafted to bust fletcher attempts? [...] Just noticed you use 2^(k/2), not (2^K)/2... this must be part of the reason :-) I thought that without using some sort of known vulnerability the formula would be; avg number of random attempts = number of possible sum values / 2 Which for a 64 bit sum would be (2^64)/2. I guess this assumes that the sum function when applied iteratively to itself the way Gabriel Nivasch suggests will walk through the entire sumspace before repeating. Does the reduced search space assume some sort of distribution of repeat loops in the sum algorithm? Wouldn't this be highly dependent on the checksum algorithm used? Does checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed? checksum_seed must be random for the checksum2 to be random -- simply, because there is no other source of randomness in the hash calculation, I was actually thinking pseudo-random, as in evenly distributed, and with no exploitable pattern ie still requiring a a exhaustive search with no known shortcuts. and if everything is deterministic then you can find collisions in advance and put them in the data. This doesn't have anything to do with the cryptographic strength of MD4. But if the attacks of [1] can be adapted to our scenario then things will become even *worse* than the above, and this is a good reason to switch to MD5. [...] Hmmm. yeah. The deterministic nature of a fixed seed is a bit of a problem. You can still use the same formula, just don't count checksum1 in the checksum size part of it. Or depending on how much you think it's worth you could count it as x32 bits worth of checksum, not the full 32 bits. In adversarial setting the Fletcher checksum is worthless. So to obtain the intended 2^-10 failure probability, rsync should increase its strong checksum by 4 bytes. That's for rsync in normal mode, where checksum_seed is randomized. For fixed checksum_seed (i.e., rsync in batch mode and librsync) the failure probability in adversarial setting is 1, as demonstrated above. scary. A couple of related notes: librsync really ought to include a whole-file hash, like rsync. Maybe also an option of retransmitting the file in case of failure (where possible). Either way, failure must be reliably detected. Beside increasing reliability, in many cases it would allow use of smaller block checksums. You can't just entrust integrity checking to a subsequent invocation of md5sum or such; re-reading the files is expensive and not always possible. Yeah. I've been thinking the same for some time. Because librsync separates the signature/delta/patch operations, retransmission is an application level decision, but librsync should certainly warn you when it fails. The use of CHAR_OFFSET is a pure waste of CPU cycles. It merely amounts to adding constants to s1 and s2 (blocklen*CHAR_OFFSET and blocklen*(blocklen+1)/2*CHAR_OFFSET, respectively). These depend only on the block length, which you know
Re: MD4 checksum_seed
On Wed, 2004-03-10 at 16:43, Eran Tromer wrote: [...] Note that, above, block hash collisions are very easy to find if you know checksum_seed. The rolling Fletcher checksum1 is trivially defeated. To defeat the k-bit truncated MD4 checksum2, just keep generate random blocks having the same checksum1 until you find two with the same checksum2; by the birthday paradox it will take about 2^(k/2) attempts, where usually k=16 or k=24 with J.W. Schultz's code. I haven't really thought much about carefully crafted data to bust things. I didn't think the Fletcher checksum was that bad. Although I can see that it would be fairly trivial to craft data for collisions, wouldn't it be harder to craft data for a Fletcher collision _and_ attempt to produce a checksum2 collision? The 2^(k/2) attempts assumes random attempts... how does this compare to crafted to bust fletcher attempts? Another note is that Donovan Baarda's formula for the probability of retransmission (and thus J.W. Schultz's code) assumes that the hashes are random. This is a reasonable assumption for the truncated MD4 checksum2 when checksum_seed is random, but it's a pretty rotten Does checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed? assumption for the Fletcher checksum1. For the purpose of evaluating the probability of retransmission, rsync uses s2length bytes of good hash plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really apply. You can still use the same formula, just don't count checksum1 in the checksum size part of it. Or depending on how much you think it's worth you could count it as x32 bits worth of checksum, not the full 32 bits. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [patch] Add `--link-by-hash' option.
On Tue, 2004-02-10 at 07:48, Jason M. Felice wrote: This patch adds the --link-by-hash=DIR option, which hard links received files in a link farm arranged by MD4 file hash. The result is that the system will only store one copy of the unique contents of each file, regardless of the file's name. Does this mean it also automatically detects renames? Anyone have an example of an MD4 collision so I can test that case? :) How do you recover from that case? Patch Summary: -1 +1Makefile.in -0 +304 hashlink.c (new) -1 +21 options.c -0 +6proto.h -5 +21 receiver.c -0 +6rsync.c -0 +7rsync.h If this does everything I think it does, then it's a surprisingly small amount of changes for what it does. __ -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: File rename detection?
On Thu, 2004-01-29 at 06:27, jw schultz wrote: On Wed, Jan 28, 2004 at 09:06:52PM +0300, ??? wrote: Hello! As I was found rsync do not detect file renaming. If I just copy my backup.0.tgz (many Mbytes in size having it's md5) to backup.1.tgz (which will be equial in size and md5) it will be the same file in fact... Rsync will delete old file (backup.0.tgz) on the receiving side and download new one (backup.1.tgz). I do not think there are any difficulties to detect this situation and follow the natural way: just rename the file on the receiving side. Am I right? Reliably detecting renamed files in a manner that rsync can act on it is very difficult. I believe rsync currently sends whole-file md4sums along with the signature. At least when using the -c option, it _could_ use the md4sums to identify identical but moved files. Actually implementing it is another story :-) -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: how rsync works
On Tue, 2003-11-25 at 06:04, jw schultz wrote: On Mon, Nov 24, 2003 at 09:14:56AM -0600, John Van Essen wrote: [...] I had thought so too but now i'm less sure. I put it up on October 26th -- for discussion. In the next 14 days there were hits from 107 unique IPs, 73 of which were in the first 4 days. Then it settled down to a slow, declining, trickle that could have been naught but robots. But no one said anything online or off. So on the 8th of November i took it down. Now over two weeks after i took it down and four since putting it up you are the first person to even acknowledge it. I had a look at it and thought it was good... The intent was to, if there is interest, get feedback to polish the document for placement on the web site. Or to let someone who is perhaps a better writer take it over. Not to self-publish the thing. Perhaps i had to high of expectations but i expected at least one or two would say something, even if it was just yuck or its a start before knowledge of it subsided into archives. I didn't comment at the time because none of the above applied to me, and I was slack :-( I'll put it back up if i hear more than just a whimper. At the time I read it I thought; good, something I can point people at when they ask me. I think it was a good resource. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Synching text files.
On Thu, 2003-10-23 at 05:39, Kurt Roeckx wrote: On Wed, Oct 22, 2003 at 11:46:22AM -0700, jw schultz wrote: On Wed, Oct 22, 2003 at 08:31:12PM +0200, Kurt Roeckx wrote: Hi, I'm trying to sync text files where lines get inserted and deleted. From what I understand, rsync uses blocks to compare and so can't find much data that matches. Would it be possible to make it work so it finds blocks that get inserted or deleted? It already does. Does it only do it on block basis? Or will it try to find an offset inside the block? It will find matching blocks at arbitrary byte offsets. Think of the original file as a sequence of fixed sized blocks. Inserting or deleting a single byte breaks that block so it no longer matches, but rsync will match all the blocks before and after that non-matching block. Note that one think that can break rsync for text files is MSDOS CR/LF vs Unix LF line termination. This effectively makes a change every line, and unless you have lines longer than the block size you are using, rsync will not be able to find a single match. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Pysync 2.24 release, was Re: rsync on OpenVMS
On Tue, 2003-10-14 at 11:01, Donovan Baarda wrote: On Mon, 2003-10-13 at 13:00, John E. Malmberg wrote: jw schultz wrote: On Sun, Oct 12, 2003 at 12:38:40AM -0400, John E. Malmberg wrote: [...] I have not heard of unison. I have heard that pysync was successful in a limited test on OpenVMS. As near as I can tell though the librsync it is based on is a bit out of date. [...] Something possibly worth trying on it is psyco... it compiles python to native code on the fly using a simple import psyco. Pure python is a bit slow compared to native C implementations, but psyco could help close the gap a bit. Following up on this... I tried using psyco with python2.2 and it cut the pysync tests on my machine from 21secs down to 14secs... that's a 33% speedup. In the past I'd tried using pyrex to speed things up with no success. psyco not only gives a better boost, but is much easier to use. I haven't touched pysync for a while, but it should still work with the latest librsync as the API hasn't changed. If there are any problems, please let me know. I believe rdiff-backup also has a python wrapper for librsync that might be more advanced than the one in pysync. I have plans for both pysync and librsync, but I haven't worked on them much lately. I find I am mostly motivated by feedback from others when funding is not available :-) This little bit of interest motivated me to have a look at it again, and I've just released version 2.24. From it's NEWS: Updates between release 2.16 and 2.24 - * Added TODO and NEWS files. * Changed to use psyco if available, giving a 33% speedup. * Updated to use librsync 0.9.6. * Changed to using a faster md4sum implementation based on the librsync implementation, modified to use the RSA API. * Added rollin/rollout support to historical adler32.py. * Minor cleanups to rollsum code. * Minor tweaks to handling of block fragment matching. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: rsync on OpenVMS
On Mon, 2003-10-13 at 13:00, John E. Malmberg wrote: jw schultz wrote: On Sun, Oct 12, 2003 at 12:38:40AM -0400, John E. Malmberg wrote: [...] I do not know but if OpenVMS support is a problem for rsync proper you might wish to look at pysync or unison which might meet your immediate needs. I have not heard of unison. I have heard that pysync was successful in a limited test on OpenVMS. As near as I can tell though the librsync it cool... someone playing with pysync :-) is based on is a bit out of date. pysync includes a pure-python implementation that will run on anything that can run python, as well as a basic wrapper for librsync. It also includes extension modules for md4sum and rollsum that speed it up a fair bit. Something possibly worth trying on it is psyco... it compiles python to native code on the fly using a simple import psyco. Pure python is a bit slow compared to native C implementations, but psyco could help close the gap a bit. librsync is still under active development by myself and others, and recently had a new release. I haven't touched pysync for a while, but it should still work with the latest librsync as the API hasn't changed. If there are any problems, please let me know. I believe rdiff-backup also has a python wrapper for librsync that might be more advanced than the one in pysync. I have plans for both pysync and librsync, but I haven't worked on them much lately. I find I am mostly motivated by feedback from others when funding is not available :-) -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] Re: state of the rsync nation?(revisited 6/2003 from 11/2000)
On Mon, 2003-08-04 at 11:05, Martin Langhoff wrote: Donovan Baarda wrote: For the record, librsync version 0.9.6 is _almost_ ready in CVS. A bug has been detected but not isolated yet for large files (2G+). If it's not squashed soon I'm tempted to release anyway with a KNOWN_BUGS that reports this. Last night I found and squashed a bug that caused the problem. I'm waiting on feedback from the user that reported it to confirm it is now fixed. Donovan, I am just looking around for hints on compiling librsync under cygwin (gcc), and happened to find this post of your promising 0.9.6, which runs under cygwin... http://groups.google.com/groups?hl=enlr=ie=UTF-8threadm=b7jecs%242sdr%241%40FreeBSD.csie.NCTU.edu.twrnum=2prev=/groups%3Fq%3Dcygwin%2Blibrsync%26hl%3Den%26lr%3D%26ie%3DUTF-8%26sa%3DN%26tab%3Dwg I know that post was ages ago, but due to work commitments I never got a chance to release it. Every time I got a small chance to look at it again, I kept finding little niggly things that really should have been fixed before release. The CVS is only intermittently updated, so I only bother tagging on release, which still hasn't happened yet. The CVS HEAD is the current release candidate. I don't commit unless it passes a make distcheck so it should never be broken. Are the sources at least tagged, so I can get the mfrom CVS? What tags? Your timing is impeccable. Just last night I finalised the MSVC updates and fixed the last known bug. That means there is nothing outstanding for the release... this should be 0.9.6, but I'm giving it a day or so to get feedback before I tag and release. There is a delay on SF between developer and anon CVS, so it is possible the final checkins are not yet visible. In the meantime, there is a tarball available at; http://minkirri.apana.org.au/~abo/projects/librsync/librsync-0.9.6.tar.gz This compiles and builds on win32 using cygwin and/or MSVC. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Large files and symlinks
On Thu, 2003-07-31 at 06:53, jw schultz wrote: [...] In many cases invoking --partial is worse than not. If you are rsyncing a 4GB file and transfer is interrupted after 500MB has been synced you get a 500MB file which now has less in common with the source than the 4GB file did. A more useful behaviour for --partial would be to concatinate the partial download to the end of the old basis, rather than replace it... this leaves you with a much more useful partial result to resume from. Of course this behaviour could be _very_ confusing to people... :-) -- Donovan Baarda [EMAIL PROTECTED] -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Large files and symlinks
On Thu, 2003-07-31 at 10:01, jw schultz wrote: On Thu, Jul 31, 2003 at 09:22:51AM +1000, Donovan Baarda wrote: On Thu, 2003-07-31 at 06:53, jw schultz wrote: [...] In many cases invoking --partial is worse than not. If you are rsyncing a 4GB file and transfer is interrupted after 500MB has been synced you get a 500MB file which now has less in common with the source than the 4GB file did. A more useful behaviour for --partial would be to concatinate the partial download to the end of the old basis, rather than replace it... this leaves you with a much more useful partial result to resume from. Of course this behaviour could be _very_ confusing to people... :-) Interesting idea. I don't know that it would be all that confusing. You'd have to truncate the basis to the length of the source to prevent it growing with each failure. Even appending 3.5GB to a 4GB file once is problematic. The simplest solution is to write the partial download over the beginning of the old file, leaving the end part as it was. This way you are making the sensible assumption that most of the matches from the start of the file match the partial download, and the remainder will match the rest when you resume. A match locality heuristic? I suspect this will be less confusing for end users too... the partial download will have it's size unchanged, and when you look at the data you will be able to see that it synchronised up to point xxx. It will look like a partial in-place update. If we were to append to the existing file it might make sense to append only those portions that were updates. That would require keeping track of the offset+length of each change block. Yuck, that is much more work that it is worth. Yeah... very overkill. One idea that i think has real merit would be to combine some kind of change-rate score with an evaluation of comparative sizes of the tempfile and the original file to decide if replacing the original or leaving it would be more efficient. If there was no data-reuse then replacement would be in order. If there was a high rate of reuse it wouldn't. If the reuse was middling you would consider the comparative sizes. The formula would probably be pretty simple. If someone comes up with a patch that does that i'd be willing to entertain it. I'm not convinced this would be worth the effort... I'm sure in many cases the beginning of the file is where most of the changes are, so throwing away the end on the basis of poor matches at the start is a bad idea. -- Donovan Baarda [EMAIL PROTECTED] -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Rolling Checksum Algorithms
On Sun, Jul 20, 2003 at 01:06:11PM +0100, [EMAIL PROTECTED] wrote: Hi, Where can I get good pointers on the rolling checksum algorithm used in rsync? I need an 8-bit or 12-bit rolling checksum too. Any place where rolling checksum algorithms are discussed? You can have a look at the rsync white paper. You can also look at pysync for sample implementations in Python; http://freshmeat.net/projects/pysync/ I think I've posted on this list some comments about various variations on the rolling checksum. I think the rsync whitepaper credits it as being a variation on the Adler checksum, but I think it is closer to something I've seen called a Fletcher checksum. Pysync has a Python implementation of a rolling adler32 checksum, and has comments about the variations in rsync, librsync, and xdelta. -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: New wildmatch code in CVS
Quoting Wayne Davison [EMAIL PROTECTED]: If you've been watching CVS, you may have noticed that I checked in some new files named wildmatch.c and wildmatch.h. This code implements the shell-style wildcard matching with rsync's extension that ** matches [...] build farm. One thing I discovered is that the various fnmatch() calls all seem to handle the character-class boundary cases in conflicting ways (things like []-]). So, one benefit for rsync of switching from fnmatch to wildmatch will be to make the callers of this function behave consistently on all platforms (which primarily affects the exclude code). [...] This might explain why Python implements its own fnmatch.py using regex's. Anyone have any concerns or comments on switching over to this new code? Only one concern, and few questions, and a maybe suggestion; The concern: Why the name wildmatch? It seems a bit too arbitary to me. I have used the name efnmatch (extended fnmatch) for it in my Python implementations. The name wildmatch is too generic, whereas efnmatch clearly indicates it is an exension to the standard fnmatch. A silly concern I know, but it will make my life easier when I start making Python extension modules out of your code to use in mine :-) Some Questions: How did you implement it (I know, I should just look in CVS, but while I'm typing...)? Does it use regexes or a modified implementation of fnmatch? How does it compare performance-wise with a regex based implementation? The reason I'm curious is Python, for whatever reason, implements fnmatch in Python using regex's rather than using a C python extension (possibly to avoid the fnmatch variations you identified). I'm wondering if it would be worth re- implemnting fnmatch (and efnmatch) as C extension modules. The maybe suggestion: I found by implementing efnmatch using regex's, it was painless to add the ability to use regex's in include/exclude lists. This meant include/exclude lists could be built using either efnmatch wildcards or regex's, as they would all be converted, compiled, and matched as regex's anyway. I don't know how regex matching compares to fnmatch matching performance-wise. I'm also aware that people have expressed concerns about linking in/against largish regex lib's. However, if the option of using regex's for include/excludes is ever going to happen, then it might be an idea to use them for this. Personally, I feel the efnmatch functionality is flexible enough to never require regex's, but I've seen a few enquiries in the past.. ABO -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] Re: state of the rsync nation? (revisited6/2003 from 11/2000)
On Wed, 2003-06-11 at 16:13, Martin Pool wrote: On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote: On Wed, 2003-06-11 at 13:59, Martin Pool wrote: [...] On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote: I forget if I saw this in Tridge's thesis, but I definitely noticed that librsync uses a modified zlib to make feeding data to the compressor and throwing away the compressed output more efficient. I have implemented this in pysync too, though I don't use a modified zlib... I just throw the compressed output away. Yes, I remember that, but that's not rzip. Just had a look at rzip... interesting... haven't yet digested its impact on things like xdelta yet though... hence the rambling explorations in this email. By the way the gzip hack is an example of a place where I think a bit of extra compression doesn't justify cluttering up the code. I think I'd rather just compress the whole stream with plain gzip and be done. I originally had the same thoughts... but then testing on real-world data showed me a 5~10% improvement on compression when using it, so I included it in pysync. priming the compressor with match data does make a difference when you have a lot of match data interspersed with your miss data. However, I agree that it is probably much less efficient than using a compression scheme designed around the match encoding rsync already uses. See http://samba.org/~tridge/phd_thesis.pdf pg 86ff rzip is about using block search algorithms to find widely-separated identical blocks in a file. (I won't go into detail because tridge's explanation is quite clear.) I am pretty sure you could encode rzip into VCDIFF. I am not sure if VCDIFF will permit an encoding as efficient as you might get from a format natively designed for rzip, but perhaps it will be good enough that using a standard format is a win anyhow. Perhaps building a VCDIFF and then using bzip/gzip/lzop across the top would be acceptable. I'm sure you could encode rzip into vcdiff, and you would loose very little, and possibly gain a bit (if you add byte-extending matches backwards, matches are never byte aligned, and vcdiff can efficiently encode any short or long offsets _and_ lengths). In fact rzip has more in common with xdelta than rsync, since it works entirely locally and can find blocks of any length. Yeah, I noticed it was very similar in some ways... though xdelta is a bit more brute force in that it builds a massive hash table for the whole file using a small block size. rzip is a very nice way to minimise the size of the hash table without sacrificing too much. Also Tridge doesn't have byte-extending matches backwards, which xdelta does. rzip can't extend matches backwards because the exponent based offset encoding can't support arbitary byte aligned long offsets. Right now I'm thinking the big innovation rzip provides is not the exponent based offset coding, but a neat way of minimising the hash table. rzip's advantage compared to gzip/bzip2 is that it can use compression windows of unlimited size, as compared to a maximum of 900kB for bzip2. Holding an entire multi-100MB file in memory and compressing it in a single window is feasible on commodity hardware. I just realised you already said what I typed above :-) The self referencing compression idea is neat but would be a... challenge to implement. For it to be effective, the self-referenced matches would need to be non-block aligned like xdelta, which tends to suggest using xdelta to do the self-reference matches on top of rsync for the block aligned remote matches. Fortunately xdelta and rsync have heaps on common, so implementing both in one library would be easy (see pysync for an example). After looking at rzip, I'm convinced xdelta needs to use rzip's expiring hash table approach to minimise the hash table xdelta is evil on gigabyte size files. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: questions about librsync
On Thu, 2003-06-12 at 04:12, Jeff Frost wrote: I'm not sure if this is the appropriate forum for questions regarding librsync, but couldn't find any others. I'm trying to get librsync working properly on Solaris 2.7 and 2.8 Sparc servers. The problem is that while librsync appears to compile cleanly, make check fails the sources.test. Does anyone have any insight as to why this might be? Might I need a specific version of some other library? Any help will be much appreciated. For starters, the best place to discuss librsync is on the librsync-devel or librsync-users list on SourceForge; http://sourceforge.net/mail/?group_id=56125 Next, make sure you are using the CVS version of librsync from the librsync project on SourceForge. If you are still having problems, I'd be interested to know... we are trying to get librsync ready for a 0.9.6 release. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: state of the rsync nation? (revisited 6/2003 from 11/2000)
On Tue, 2003-06-10 at 14:25, Martin Pool wrote: On 8 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote: The next big thing in delta calculation is probably going to be the vcdiff encoding format, which should allow a common delta format for various applications and supports self-referencing delta's, which makes it capable of compression. According to the xdelta project this has already been implemented, and I'm keen to see Josh's code, as it could be used as the basis for a cleanup/replacement of at least the patch component of librsync. Do you have a link for this? Josh plays his cards pretty close to his chest. The XDelta page seems to be even more inactive than librsync :-/ yeah, the xdelta page on SourceForge has a few announcements about vcdiff support that are fairly recent, but everything else hasn't been touched for ages. It looks like Josh is only using SF as a static front-page where he posts announcements and all the real work is happening somewhere else... Unfortunately I don't know where that somewhere else is... hence my eagerness to actually see the code. The vcdiff standard is available as RFC3284, and Josh is listed as one of the authors. I also had some correspondence with Josh ages ago where he talked about how self-referencing delta's can directly do compression of the miss data without using things like zlib and by default gives you the benefits of rsync's context compression without the overheads (rsync runs a decompressor _and_ a compressor on the receiving end just to regenerate the compressed hit context data). -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: [librsync-devel] Re: state of the rsync nation? (revisited6/2003 from 11/2000)
On Wed, 2003-06-11 at 13:59, Martin Pool wrote: On 11 Jun 2003, Donovan Baarda [EMAIL PROTECTED] wrote: The vcdiff standard is available as RFC3284, and Josh is listed as one of the authors. Yes, I've just been reading that. I seem to remember that it was around as an Internet-Draft when I started, but it didn't seem clear that it would become standard so I didn't use it. I'm not sure if this is the same one... I vaguely recall something like this too, but I think it was an attempt to add delta support to http and had the significant flaw of not supporting rsync's delta-from-signature. It may have come out of the early xdelta http proxy project. IMHO rproxy's http extensions for delta support were better because they were more general. There was also another thing I saw which was a compact delta representation spec that I think librsync uses (perhaps it was you who had some discussion about it in the old librsync TODO?), and may have influenced the vcdiff RFC, but AFAIK was never official in any way. I also had some correspondence with Josh ages ago where he talked about how self-referencing delta's can directly do compression of the miss data without using things like zlib and by default gives you the benefits of rsync's context compression without the overheads (rsync runs a decompressor _and_ a compressor on the receiving end just to regenerate the compressed hit context data). Something possibly similar is mentioned in tridge's thesis. I was talking to him a while ago and (iirc) he thought it would be good to try it again, since it does well with the large amounts of memory and CPU time that are available on modern machines. I forget if I saw this in Tridge's thesis, but I definitely noticed that librsync uses a modified zlib to make feeding data to the compressor and throwing away the compressed output more efficient. I have implemented this in pysync too, though I don't use a modified zlib... I just throw the compressed output away. The self referencing compression idea is neat but would be a... challenge to implement. For it to be effective, the self-referenced matches would need to be non-block aligned like xdelta, which tends to suggest using xdelta to do the self-reference matches on top of rsync for the block aligned remote matches. Fortunately xdelta and rsync have heaps on common, so implementing both in one library would be easy (see pysync for an example). If I didn't have paid work I would be prototyping it in pysync right now. If anyone wanted to fund something like this I could make myself available :-) I strongly agree with what you said a while ago about code simplicity being more valuable than squeezing out every last bit. Yeah, my big complaint about librsync at the moment is it is messy. Just cleaning up the code alone will be a big improvement. I would guess that at least 30% of the code could be trimmed away, leaving a cleaner and more extensible core, and because messy leads to inefficient, it would be faster too. -- Donovan Baarda [EMAIL PROTECTED] http://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: state of the rsync nation? (revisited 6/2003 from 11/2000)
as a test bed for rsync 3 type development (superlifter?). For the future I can see continued support of the exising rsync code. I would also like to see librsync adopt vcdiff as it's delta format, and get a major cleanup, possibly by re-using some xdelta code. There are many common elements to the xdelta and rsync algorithms, and I see no reason why a single library couldn't support both (as pysync does). It would be nice if librsync and/or xdelta could become _the_ delta library. -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: MD4 checksum fix
On Sun, 2003-04-06 at 10:06, [EMAIL PROTECTED] wrote: I agree, they should be done together. I don't have my original patch but I can reimplement it with the correct remote_version dependence and send it in the next couple of days (by Thursday evening). My intent is the minimal set of changes, rather than changing the internals. That is a workable timeframe, thanks. Two days late, but here is the patch against 2.5.6. This implements the following changes: - for protocol version = 27, mdfour_tail() is called when the block size (including checksum_seed) is a multiple of 64. Previously it was not called, giving the wrong MD4 checksum. - for protocol version = 27, a 64 bit bit counter is used in mdfour.c as required by the RFC. Previously only a 32 bit bit counter was used, causing incorrect MD4 file checksums for file sizes = 512MB - 4. It looks like it will work OK, but it's kinda ugly in that starts embedding version stuff into the mdfour implementation. Still... its better than the nothing I've produced :-) -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: MD4 checksum fix
On Sun, 2003-04-06 at 16:29, [EMAIL PROTECTED] wrote: It looks like it will work OK, but it's kinda ugly in that starts embedding version stuff into the mdfour implementation. Still... its better than the nothing I've produced :-) Yes, it's certainly not elegant. An alternative would be to have two different sets of MD4 routines and then have more logic in checksum.c to call the correct routines. That makes for more changes to checksum.c but avoids a library depending upon a global variable. I don't think you need two whole implementations of MD4 routines, just a replacement _tail function and replacement _result function to use it. Note that truncating the bit count to 32 bits can be done in the _tail function. It might even turn out easier to just write a _result replacement that munges the mdfour struct (ie, truncs the byte count and empties the tail buffer) before calling the correct _result implementation. These can be put into the checksum.c with the appropriate protocol version logic. That way the mdfour.c implementation becomes a fixed stand-alone md4 implementation (that can be replaced with faster/cleaner standard implementations further down the track). Or we could add a new function in mdfour.c, eg: mdfour_broken(), that gets called once at startup if remote_version 27. This would avoid the need for md4 to pull in remote_version. This still puts protocol version specific code into mdfour.c, making it harder further down the track to drop in a replacement implementation or link against a standard md4 library implementation. -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: MD4 checksum fix
On Tue, 2003-04-01 at 09:35, jw schultz wrote: On Tue, Jan 28, 2003 at 11:22:14PM -0800, Craig Barratt wrote: And I have several things I would like to work on and submit: - Fix the MD4 block and file checksums to comply with the rfc (currently MD4 is wrong for blocks of size 64*n, or files longer than 512MB). [...] But before I work on these I would like to make sure there is interest in including them. I've not heard from you on the adaptive checksum length patch. I shall be committing it shortly subject to objections or further discussion. It looked OK to me from an implementing what we discussed point of view... I noticed you used the test square for each bit algorithm instead of the Newton-Raphson implementation for the square-root. It's certainly more complex and I don't know if it is actually faster, but it does avoid divide ops. I don't know enough about the rest of the code to comment on whether it will break the protocol though. I would like to see the MD4 checksums fixed as well. We are very close to the upper limit on protocol versions for deployed versions of rsync. Therefore, i would like to minimize protocol increments for a while. In any other circumstance i wouldn't suggest doing so but i think it would be a good idea to integrate these two fixes in one protocol bump. If you, or someone else, has the fix for the MD4 sums handy i would be glad to coordinate in implementing both sets of patches at about the same time. If you want to send it to me, that would be fine. I can also hold off for a short while for the MD4 sum patch to get some testing. librsync includes a fixed and optimized md4sum implementation in CVS on sourceforge. It should be compatible with the rsync implementation, as that's where it originally came from and the API has not (yet) changed, despite the implementation being fairly overhauled. However, I also have another even more refined (20% less code, and faster) md4sum implementation based on this, but the API has changed to conform with the RSA implementation for inclusion in libmd. I hope to change librsync to use this API after releasing 0.9.6. This implementation is available here; http://minkirri.apana.org.au/~abo/projects/libmd/ If you were going to start changing the md4sum code, I would encourage migrating to the RSA API as it _seems_ to be more widely used (certainly in BSD land) and would allow linking against the libmd library where it is available. Unfortunately, backwards compatibility probably means you will need to include a broken implementation as well as the fixed implementation. The best way I can think to do this is to provide an additional broken MD4Final (or rs_mdfour_result using the old librsync API) that implements the old behaviour. All the brokenness was in the finalizing of the md4sum so this single alternative MDFinalBroken routine should be all you need. If people ask really nice and are not in a hurry, I could probably whip such a function up to supplement the libmd API implementation. Someone else would have to do the smarts of protocol version juggling and supporting the changed API. I'm not particularly interested in supporting the old rsync/librsync mdfour API. -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: MD4 checksum fix
On Tue, 2003-04-01 at 13:34, jw schultz wrote: On Tue, Apr 01, 2003 at 12:41:39PM +1000, Donovan Baarda wrote: On Tue, 2003-04-01 at 09:35, jw schultz wrote: [...] Comparative complexity is a matter of perspective. I think it is a bit more deterministic. It has been a few years I was talking in simple LOC what-the-hell-does-this-do terms. As this is not heavily exercised (once per file), it doesn't really matter. I'm not particularly interested in supporting the old rsync/librsync mdfour API. I can understand that. I do think the API change and the md4 bug fix should be done separately. The API change is really internal, it is the bug fix that affects the protocol. I'm not 100% sure how close the current mdfour librsync API is to the rsync API. If they are the same, you should just be able to drop it in as a replacement, and grab the broken implementation of rs_mdfour_tail to make a new rs_mdfour_broken_result Hmmm... if someone else was going to do all the rest of the integration work, I _could_ do this bit. I'd rather do it for the RSA API, but it wouldn't be too much work to change this bit over later on. All i'm offering here is to coordinate with someone fixing the bug or to take the fix (if it can be agreed upon) from someone who can encapsulate it into a patch. And that just I don't really have time to do this myself, and don't have the familiarity with the wider rsync code to do it. I've been focusing on librsync (and will continue to do so). -- Donovan Baardahttp://minkirri.apana.org.au/~abo/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [RFC] dynamic checksum size
On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote: On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote: On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote: On Sun, 2003-03-23 at 07:40, jw schultz wrote: [...] The block_size heuristic is pretty arbitary, but the blocksum_size calculation is not, and calculates the minimum blocksum size to achieve a particular probability of success for given file and blocksum sizes. You can replace the block_size heuristic with anything, and the second part will calculate the required blocksum size. I do have a variant that scales the length devisor for block size from 1024 to 16538 as the length progresses from 1MB to 128MB. This in conjunction with your sum2 length formula produces this: That sounds pretty arbitary to me... I'd prefer something like having it grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for 256M, 32K for 1G, etc. A thought occurred to me after writing this; a viable blocksize heuristic is just a fixed block size. This makes the signature size almost proportional to the filesize, except for the growth in blocksum size. I don't necisarily advocate it though. I think increasing the blocksize is a good idea as files grow because file and signature size also contribute to CPU load. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [RFC] dynamic checksum size
On Mon, 2003-03-24 at 03:36, jw schultz wrote: On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote: On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote: [...] CHUNK_SIZE is used in mapping the file into memory. I'm not absolutely sure (haven't spelunked that part of the code) what would happen if block size were to exceed it. I suspect it would be OK because i don't recall hearing anyone complain as a result of using the --block-size option with a huge value. Did the old code limit it to 16K even after people manually set it to something larger? The problem with growing the block size indefinitely is that you must balance the cost of the checksums with the cost of sending large blocks for small changes. Yeah, but you also need to keep it in perspective; if you are syncing a 4G file, do you really care much that it transmitted a 64K block instead of a 16K block? And that's assuming that you could actually have sent a 16K block instead. The ideal block size to minimimize data transfer is more based on the nature of the changes than the file size. However, given no further info than the size of the file, I would guess it's a fair heuristic to assume a large file will have large changes, which means larger blocks don't loose you much. How many 260GB files do you have? On the other hand, with the fixed block length we see a diminishing advantage. We also take a hit with the hashing, match lookups and memory footprint. I think this is more of a concern than the cost of sending slightly larger blocks. That sounds pretty arbitary to me... I'd prefer something like having it grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for 256M, 32K for 1G, etc. I most strongly lean toward a bounded square-root of file size. It produces a nice smooth curve on resource consumption diminishing load/MB as the files grow. I just ...until you hit the boundary. I dislike boundaries and step functions because they break those nice smooth curves :-) hadn't a decent fast integer sqrt function at hand. I don't really care to start bringing in the math lib just for this one calculation. I've cobbled together one that is OK. [...] In case you didn't notice the count non-zero shifts is an approximate integer log2 function. There are integer approximations for sqrt that you can use, including Newton-Raphson; int sqrt(int R) { int x,x1; x = 1024; /* starting guess close to what we expect */ repeat { x1 = x; x = (x1 + R/x1) 1; } while (x-x1); return x; } Clearly we are getting into the realms of sci-fi here. Even with the 16KB block length limit files in the 1-16GB range should be manageable on almost any system that is likely to have them. If we must have an upper limit on block size, I'd prefer it to be continuous up to the 4G mark. This means an upper limit of 64K. The s2length calculation is using the algorithm you provided: #define BLOCKSUM_EXP 10 Might be worth putting a comment here; /* The following implements the function; blocksum_bits = BLOCKSUM_EXP + 2*log2(len) - log2(block_len) which gives a probability of rsync algorithm corrupting data and falling back to whole file transfer of 1/2^BLOCKSUM_EXP */ This is something future generations might want to to tweak or perhaps grok :-) b = BLOCKSUM_EXP; l = len; while (l = 1) { b += 2; } c = block_len; while (c = 1 b) { b--; } s2length = (b + 1 - 32 + 7) / 8; s2length = MAX(s2length, 2); Probably want to add; s2length = MIN(s2length, MD4SUM_SIZE); just to be on the safe side :-) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [RFC] dynamic checksum size
On Mon, 2003-03-24 at 16:33, jw schultz wrote: On Mon, Mar 24, 2003 at 10:56:42AM +1100, Donovan Baarda wrote: On Mon, 2003-03-24 at 03:36, jw schultz wrote: [..] hadn't a decent fast integer sqrt function at hand. I don't really care to start bringing in the math lib just for this one calculation. I've cobbled together one that is OK. [...] In case you didn't notice the count non-zero shifts is an approximate integer log2 function. There are integer approximations for sqrt that you can use, including Newton-Raphson; Actually, Newton-Raphson is not really an approximation (when compared to count non-zero shifts), but you can turn it into one by stopping iteration when the answer is close enough. int sqrt(int R) { int x,x1; x = 1024; /* starting guess close to what we expect */ repeat { x1 = x; x = (x1 + R/x1) 1; } while (x-x1); return x; } I'm more concerned with cycle count and cache effects than loop iterations. Division, at least historically, is an expensive operation compared to multiplication. The above converges surprisingly quickly... for arbitrary numbers up to 2^36 it typically converges within 10 iterations. For numbers under 2^20 it seems to converge within 5 iterations. Also, this is only done once per file to determine the block-size to use. If you are really concerned about divisions, there is another algorithm that uses only multiplies, and does one iteration per bit in the solution... for a 64bit integer it is 32 iterations to find the solution. It's nice for implementing in assembler on micros without a hardware division, but it's a bit messier to implement in C. If we must have an upper limit on block size, I'd prefer it to be continuous up to the 4G mark. This means an upper limit of 64K. I'm ambivalent. We don't seem to have a problem with huge block sizes. One possibility would be to allow the user to set a ceiling by adding an option (--max-block-size) My inclination at the moment (may change my mind) would be to have no upper bound and see if anyone else has a problem. If problems appear we could either impose a limit or add a --max-block-size. I think --max-block-size would be a waste of time... people can always specify a particular block size if they are unhappy with the heuristic. I suspect that no-limit will prove to work well. On the whole i'd say we are zeroing in on something good. I'll see about regenerating the patches. Yeah... I think I'll have to implement this heuristic in pysync now :-) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [RFC] dynamic checksum size
On Sun, 2003-03-23 at 07:40, jw schultz wrote: [...] The two attached patches implement per-file dynamic checksum lengths. Nice one. Here is a quick table showing the file sizes at which transition occurs, the sum2 length and the jump in size of the block sum table transmitted: filesize sum2lentransition 125MB 2 125MB 3 8KB 1GB 432KB 4GB 5 128KB 16GB 6 523KB 64GB 7 2MB My reading of the patch doesn't quite match with the above. The block size heuristic is 1/1 the filesize, limited between some miniumum and maximum. Ignoring the limits, this means the number of blocks is kept at a constant 1, leaving you stuck at three bytes. This heuristic is a stab in the dark. It may be that it is too aggressive in the rate of increase or insufficiently aggressive at starting the increase. It may be that a 8 bits of sum2 in conjunction with the 32 bit adler is enough for smaller files. We may also wish to revisit the heuristic for block sizes that fixes the block count at 1000 for files between 700KB and 16MB, or coordinate them so that the transitions would be easier. Both heuristics can be changed at will without breaking compatibility. I posted some more maths on this a while ago to the old rproxy list to someone inquiring about how this affects librsync. In summary, a conservative simplification of the relationship between block size, blocksum size, and file size is; p = (s^2)/(B.(2^b)) where: p is the probablility of a false match (ie, corrupt result) s is the file size. B is the block size. b is the number of bits in the block checksum (blocksum size) Note that blocksum size is the total blocksum including the rollsum. If you plug in an acceptable probability of failure, you get a direct relationship between file size, block size, and blocksum size. Assuming a 1/2^n probability of failure is OK, you get; B = (s^2)/(2^(b-n)) b = n + 2.log2(s) - log2(B) Note that assuming a fixed blocksum size, the block size must grow at the square of the file size. Assuming a fixed bock size, the blocksum must grow at the log of the filesize. I think it makes sense to grow the block size linearly with file size (which keeps a constant number of blocks), and then have blocksum size grow as necessary to make up the difference by default. It might also be good to have the ability to specify either blocksum size or block size and have the other estimated using the above equations. Of course you will want to round up the blocksum size to the nearest byte. Using a heuristic that aims for 2^m number of blocks and plugging this in to the above equations gives; B = s/(2^m) b = n + log2(s) + m (Note: 32bits of this is the rollsum) Assuming 8K number of blocks (m=13), and an acceptable error rate of less than 1 in 1024 file transfers falling back to whole file (n=10), you get; file size block size blocksum size 8M 1K 46 (sum2len=2) 32M4K 48 (sum2len=2) 128M 16K 50 (sum2len=3) 512M 64K 52 (sum2len=3) 2G 256K54 (sum2len=3) 8G 1M 56 (sum2len=3) 32G4M 58 (sum2len=4) 128G 16M 60 (sum2len=4) As you can see from this, your heuristic is very conservative for large files, but optimistic for smaller files. Might I suggest the following code fragments; #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */ #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */ if (block_size) { block_len = block_size; } else { block_len = (len BLOCKLEN_EXP) ~15; /* rough heuristic */ block_len = MIN(block_len, BLOCK_SIZE); block_len = MAX(block_len, (CHUNK_SIZE / 2)); } size_t b = BLOCKSUM_EXP; size_t l = len; while (l 1) { b+=2; } size_t c = block_len; while (c 1) { b--; } sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */ -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [Apt-rpm] I: [PATCH] 0.5.4cnc9: rsync method support
On Fri, 2003-02-21 at 18:42, Sviatoslav Sviridov wrote: [...] And other thing I want to discuss: now rsync designed as standalone program, but it will be good if library with protocol implementation created, and build rsync program based on this library. Hmmm, sounds like http://librsync.sourceforge.net/. The current rsync implementation is tightly coupled and difficult to split out. There have been various attempts to do this by starting from scratch. librsync is probably the most advanced rsync algorithm implementation, and there have been various experiments with alternative protocols. Someone even announced a perl implementation that is rsync compatible (impressive)! However, librsync lags behind rsync in various ways... the most important of which is active development. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync vs. rcp
On Thu, 2003-02-20 at 05:55, va_public wrote: I got used to rsync's -v --progress option so much that I used it instead of rcp even to simply copy files across the network. I dont like software that doesnt talk to me! :-) I like the percentage bar that --progress gives! To my surprise, I found that, when dealing with 1GB+ files, rsync is 4-5 _times_ slower than rcp. RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently large block size. See the following; http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html This probably needs to be documented somewhere newbie's can find it. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync vs. rcp
On Thu, 2003-02-20 at 08:55, James Knowles wrote: RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently large block size. According to the archives, block size doesn't fix anything. At any rate, I'm highly disappointed that rsync is relying on statistical good fortune. increasing block size does increase the probability that the rsync algorithm will succeed. We've used rsync extensively in our company for moving information around. With tens of gigabytes of data at stake, it's very disappointing. Now we have to dump rsync, redo all of our backups, and possibly write our own tool. I don't want to futz around with something that I can no longer trust. If you read the thread, rsync the tool is reliable, just the rsync algorithm does not work (as currently implemented). The rsync tool will fall back to a whole file transfer if the rsync algorithm fails, as indicated by a full md4sum of the whole file. This does mean rsync the tool will end up transferring the file twice when this occurs; once using the rsync algorithm, and again sending the whole file. As for rsync relying on statistical good fortune, that should have been clear from the beginning. However, because of the whole file checksum used to verify the final transfer, there is a 1/2^128 chance of it getting it wrong per file. That is less than 1/10^36 chance, or 1 in so-many-billions-of-billions-of-billions-that-its-not-worth-worrying-about. So if you spend from now until eternity transferring every file that ever existed again and again... you might get one file wrong, but not very likely. I think there is a higher probability of getting corruption doing a 'cp' directly onto the same HDD (cosmic rays etc). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync vs. rcp
On Thu, 2003-02-20 at 08:53, va_public wrote: --- In [EMAIL PROTECTED], Donovan Baarda [EMAIL PROTECTED] wrote: On Thu, 2003-02-20 at 05:55, va_public wrote: RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently large block size. See the following; http://www.mail-archive.com/rsync@l.../msg05219.html OK. I read the thread. Pretty interesting design discussion on rsync internals. Has any of it been implemented in 2.5.6? When is it planned to implement it? So, bottom line, you are telling me, dont use rsync for my Oracle database backups? Thats a bummer. Oracle database files are pretty sparse and, at a block level, very few things change, so I was hoping that using 'rsync' instead of 'rcp' to do my daily backups would be IMMENSELY faster. If you notice, the critical thing in the formula is the number of _different_ bytes not really the whole file size. If you know less than 100M of the file is different, then use the block size recommended for a 100M file. For oracle database files, you probably also want to use a block size that is a multiple of whatever internal unit size oracle uses. As per the final messages on the thread, just increasing the csum from 2 to 4 bytes seemed to miraculously solve the problem, right? So why isnt this patch being included? because it breaks backwards compatibility with older rsync versions... it needs a protocol version change which is a bit trickier to introduce without causing problems. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync vs. rcp
On Thu, 2003-02-20 at 11:36, Craig Barratt wrote: RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently large block size. See the following; http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html Let's be careful here. Rsync *does* work on 1GB+ files. What you probably meant to say was that rsync's first pass might fail on files that have 1GB+ of changes. But the second pass (which uses the full MD4 block checksum) will work correctly. Yeah, I was referring to the rsync algorithm as implemented in rsync, not the rsync tool as a whole. Even then, it arguably does not work in terms of it ends up transferring much more data (up to 2x) than a pure scp does. It does end up correctly transferring the file though. So a more correct statement is that Rsync might work *slowly* on files with 1GB+ of changes because two passes are required. BTW, rsync already has an adaptive block size unless you set it on the command-line. The block size is roughly min(16384, max(700, file_size/1)) ie: file_size/1, but no less than 700 and no more than 16384. I wasn't aware that it had this. Was it there at the time of the original discussion (Oct 2002)? The people involved in the discussion then didn't seem to know this. However, it's not really adequate. A 16K block size only really works for files up to about 500M. Still... that's a lot better than I thought it was at the time. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync vs. rcp
On Thu, 2003-02-20 at 13:20, jw schultz wrote: On Thu, Feb 20, 2003 at 01:03:16PM +1100, Donovan Baarda wrote: On Thu, 2003-02-20 at 11:36, Craig Barratt wrote: RSYNC DOES NOT WORK WITH 1GB+ FILES... unless you have a sufficiently large block size. See the following; [...] However, it's not really adequate. A 16K block size only really works for files up to about 500M. Still... that's a lot better than I thought it was at the time. I'd be very open to increasing the upper limit on the dynamic block size but it should be done in balance with increasing (dynamically per-file) the block csum-length. increasing the block size is kinda a hack to compensate for not being able to increase the csum-length. If you can scale the csum-lenth, then you don't need to increase the block size to ensure the rsync algorithm works (but you still probably want to do it to keep your signature size and processing overheads down). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync 1tb+ each day
On Thu, 2003-02-06 at 10:01, Eric Whiting wrote: Kenny Gorman wrote: [...] I tested with a small 256M datafile. rsync -av is showing me about 200kBytes of changes in the datafile between each snapshot. (about 1/1000th of the file has actually changed between the hot backups) Rsync reports the speedup as somewhere between 800 and 1000 just as I would expect. This speedup is a number that accounts only for bytes transferred (or not transferred) -- not real time. When I time the rsync -av runs and compare to a rsync -W the real run times are approximately the same. (similar results if I rm the file on the dest and rsync -av). block checksum and rewrite overhead right? [...] Similar speedup numbers for some 1G database files -- but the real time is 2x when the destination is there and mostly correct. For this type of data it is 'faster' from a time standpoint to remove the destination file before running rsync. Not what I would expect. You are probably seeing 2x because you are running rsync on a 1G file without a sufficiently large block size. When you do this rsync actually transfers the file twice; once using the rsync algorithm, and again sending the whole file because the rsync transfer corrupted the file (as indicated by the final checksum). There is a 99.7% chance the rsync transfer will fail for a 1G file and the default block size. This is because rsync doesn't use sufficiently large signature checksums for large files, which results in corruption of the transferred file. Fortunately rsync detects this using a final checksum, and falls back to a whole file transfer. To benefit (95%+ chance of success) from using rsync on a 1GB file, you need to use a block size of 75kB or greater. For details, see; http://www.mail-archive.com/rsync@lists.samba.org/msg05219.html In any case, you might find that setting larger block sizes improves performance when you have a fast network; large block sizes reduce the CPU overhead at the expense of reduced traffic savings. You might find a sweet spot somewhere that is faster than -W for your particular situation. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Proposal that we now create two branches - 2_5 and head
On Wed, 2003-01-29 at 18:22, Craig Barratt wrote: I have several patches that I'm planning to check in soon (I'm waiting to see if we have any post-release tweaking to and/or branching to do). This list is off the top of my head, but I think it is complete: And I have several things I would like to work on and submit: - Fix the MD4 block and file checksums to comply with the rfc (currently MD4 is wrong for blocks of size 64*n, or files longer than 512MB). I'm interested in contributing to this effort too; I have helped fix and optimise the md4sum implementation for librsync. Once the issue of protocol backwards compatibility is resolved, this code could be dropped as is into rsync (up to 20% speed increase over current implementation). However, I also want to change to using the RSA implementation API, and submit this code upstream to libmd (it has an optimised md5sum, but only the RSA md4sum). That way all the projects that used the RSA implementation could benefit from a more optimised md4sum (and can contribute bug fixes etc). - Adaptive first pass checksum lengths: use 3 or more bytes of the MD4 block checksum for big files (instead of 2). This is to avoid almost certain first pass failures on very large files. (The block-size is already adaptive, increasing up to 16K for large files.) [...] But before I work on these I would like to make sure there is interest in including them. IMHO adaptive checksum sizes is critical. People are starting to rsync partition images and the maths shows rsync degenerates really badly as file sizes get that big. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
RE: Proposal that we now create two branches - 2_5 and head
On Thu, 2003-01-30 at 07:40, Green, Paul wrote: jw schultz [mailto:[EMAIL PROTECTED]] wrote: [general discussion of forthcoming patches removed] All well and good. But the question before this thread is are the changes big and disruptive enough to make a second branch for the event of a security or other critical bug. Agreed. [...] After reading arguments, is support the delay the branch till it absolutely must happen approach... ie don't branch until a bugfix needs to go in to a stable version and HEAD is way too unstable to be released with the fix as the new stable. Quite true. But I'd like to make the point that I think it is worth making the decision to split now. Having two branches will change attitudes. And I think with as large a community of users as rsync clearly has, it is worth changing attitudes. Having a production branch will remind us that we have [...] Actually, a bigger attitude issue for me is having a separate rsync-devel and rsync-user lists. I have almost unsubscribed many times because of the numerous newbie user questions; I'm only interested in the devel stuff. I'm sure there are many users who have unsubscribed because of the numerous long technical posts. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync protocol description
On Wed, 2002-12-18 at 19:03, Christoph Bartelmus wrote: Donovan Baarda wrote: For that I'd be very much interested in a description of the protocol that rsync talks on port 873. Is such a description available somewhere? I don't want to have to deduce the protocol from the source code. The protocol changes as rsync is updated, and is very evolved Does that mean that rsync does not maintain downwards compatibility? My former examination did not indicate this. The startup protocol includes the exchange of protocol versions in use. I guess there's a reason for it. Otherwise I could imagine that this could be a steady cause of trouble. rsync does mantain backwards compatibility... that is potentially part of the problem. The new versions check the protocol version and will use old protocols if required. This means rsync has to support all the old various broken forms of the protocol. A small example of this is the md4sum code has a bug that makes md4sums incorrect for large files. This does not break the file transfers in any way because it is consistantly incorrect. However, if someone fixed this, it would mean rsync would have to include two md4sum implementations; the correct one, and the broken one for backwards compatibility. if you were to implement your own rsync client, you would either have to reject older versions of the protocol, or implement them all. In any case, you will be constantly playing catchup with rsync. re-implementing and tracking it is not really viable/desirable. What I have found so far are the description of the algorithms (rsync technical report and Andrew Tridgell's PhD thesis). Have I overlooked something obvious? http://freshmeat.net/projects/pysync/ The ftp-server minkirri.apana.org.au gives me a 421 Service not available, remote server has closed connection. Hmmm, that's wierd. The logs show nothing wrong, and 8 transfers today and no down time. However, when I looked I did notice that I'd screwed up the alternative http access yesterday when I fiddled with my apache configs. If ftp is causing the problems, try; http://minkirri.apana.org.au/pub/python/pysync/ It should be working OK (now :-). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync protocol description
On Wed, 2002-12-18 at 02:11, Christoph Bartelmus wrote: Hello, I'm currently evaluating the possibility of implementing a rsync client in a project for my company. The platform used is currently not supported and implementing the client from scratch currently seems to be the most feasible approach. You do _not_ want to do this, unless you are prepared to make your client implementation incompatible with the real rsync. If you need real rsync compatibility, then your best bet is to port rsync to the platform and submit your patches upstream. This is your best chance of achieving and maintaining rsync compatibility. I you are prepared to forgo rsync compatibility, you should look at librsync. There are also various prototype implementations like superlifter, pysync, etc. For that I'd be very much interested in a description of the protocol that rsync talks on port 873. Is such a description available somewhere? I don't want to have to deduce the protocol from the source code. The protocol changes as rsync is updated, and is very evolved re-implementing and tracking it is not really viable/desirable. What I have found so far are the description of the algorithms (rsync technical report and Andrew Tridgell's PhD thesis). Have I overlooked something obvious? http://freshmeat.net/projects/pysync/ is a Python implementation of the algorithm and various variations like xdelta in ~400 lines (mostly documentation)... can be useful for understanding the algorithm or prototyping stuff. ABO -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync to 2000/NT servers?
On Tue, 2002-12-17 at 21:13, John Morgan Salomon wrote: Dr. Poo wrote: Now, can you think of a way to sync the win 2000 OS? (the WHOLE flippin' system) so that if it were to go down one could restore the full installation (bootstraps, bootloader, ect!!?) by means of the rsync'ed backup. please? thank you. ;-) Yeah. Symantec Ghost. Seriously, Ghost is probably the only tool that can do this well at the moment. If you have used NTFS, there is bugger all that can read it, and win2000/NT itself can't even read all of the partition it is running on. I read a review of Ghost that complained that backups/restores actually boot you into DOS to run. This is because win2000/NT locks certain OS files when running. Ghost _has_ to boot into DOS and then use its own NTFS read/write code to access the partition. In theory you could build a similar utility from Linux, but the Linux NTFS read/write code is a bit dodgey. I've even had older versions of Ghost complain because it couldn't always read the NTFS partition (something to do with NTFS internally... sometimes it does/doesn't use some sort of weird compaction or something). The best non-Ghost alternative is to just rsync the whole damn partition. ABO -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: FTP at rsync.samba.org down?
On Mon, Nov 25, 2002 at 11:44:10AM +0100, Michael Schmidt wrote: Hello, just being fascinated by rsync I wanted to look at the distribution files at ftp://rsync.samba.org, but it was not possible to get a ftp connection to that address. Is the ftp service down there? Thanks in advance for any soon helpful hint. If you want an easy way to see how rsync works, have a look at pysync... python implementation in under 400 lines (mostly documentation). Also includes implementations of related algorithms like xdelta. http://www.freshmeat.net/projects/pysync -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: weak checksum question
On Mon, Nov 11, 2002 at 08:06:40PM -0500, Jeff Abrahamson wrote: The weak checksum in checksum.c (see snippet below) differs substantially from the one discussed in Andrew Tridgell's doctoral thesis on rsync and elsewhere that I've been able to find. I didn't find discussion of the change in the mailing list archives. Well, so I'm curious what the benefit of the change is. (I'm using a rolling checksum in another app, thus the reason for this nitty inspection.) Can anyone provide a pointer to help me understand why this changed? There are quite a few variations of the rolling checksum. All that I know of are varients of what I know of as a Fletcher checksum. I believe the rsync documentation describes a straight Fletcher checksum, but also shows how it can be efficiently rolled. I believe rsync uses a straight Fletcher checksum. librsync adds an offset to each byte before it is fed into the Fletcher checksum. xdelta uses a random lookup table before each byte is fed into the Fletcher checksum. Pysync includes an Adler32 implementation in Python, which is the same as a Fletcher except s1 and s2 are mod'ed by the largest 16 bit prime each time. pysync also includes a C implementation of the librsync version except implemented using a very fast unrolled macro idea stolen from the adler32 implementation in zlib. The C implementation in pysync is probably the fastest one currently around. I'm unconvinced the extra effort of Adler32's mod-prime, librsync's offset, or xdelta's random lookup actually buy you anything. I have a feeling that they all end up with pretty much the same probability of a checksum clash occuring. This would suggest that a pure fletcher checksum is the best because it has the lowest overhead, but librsync's offset doesn't hurt much anyway. In checksum.c: uint32 get_checksum1(char *buf1,int len) { int i; uint32 s1, s2; schar *buf = (schar *)buf1; s1 = s2 = 0; for (i = 0; i (len-4); i+=4) { s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] + 10*CHAR_OFFSET; s1 += (buf[i+0] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET); } for (; i len; i++) { s1 += (buf[i]+CHAR_OFFSET); s2 += s1; } return (s1 0x) + (s2 16); } This elaborate confusion is an attempt to optimise the fletcher with offset varient by processing the input buffer 4 bytes at a time. Unfortunately, this doesn't buy you much because all those multiplies end up costing more than an unravelled series of additions. I posted a patch somewhere that replaced this implementation with the rollsum.c code taken from pysync... If you want to see that implementation have a look at pysync on Freshmeat. What I understood from documentation: /* Rolling checksum from Tridgell's PhD thesis. */ int get_weak_checksum(signed char *buf, int len) { int s; int s1, s2; int len4; s1 = s2 = 0; len4 = len / 4; for(i = 0; i len - 4; i += 4) { s = (buf[i] 24) + (buf[i+1] 16) + (buf[i+2] 8) + (buf[i+3]); s1 += s; s2 += (len4 - 4 * i) * s; } return (s2 16) + (s1 0x); } I think you got the documentation wrong... although maybe I remember it wrong :-) I thought the rsync documentation described this; { s1=s2=0; for (i=0, i len; i++) { s1+=buf[i]; s2+=s1; } return (s2 16) | (s1 0x); } The fletcher with offset varient is exactly the same, it just replaces s1+=buf[i] with s1+=buf[i]+OFFSET. The xdelta version replaces it with s1+=LOOKUP[buff[i]] where LOOKUP is constant array with 256 random integers. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: offline rsync
On Thu, Nov 07, 2002 at 08:09:42PM +0100, Franco Bagnoli wrote: On Thu, 7 Nov 2002, Donovan Baarda wrote: rdiff, part of librsync. There is also pysync, but it's much slower (but easier to understand/modify). I know I'm sloppy, but the documentation with rdiff is rather poor. Is there a lengthier explanation somewhere? There is a manpage with rdiff that explains how to use it. On Debian systems just install the rdiff deb. Be aware though that the 0.9.5 version of rdiff/librsync has a bug that results in larger than necisary delta's. The CVS for the librsync project on SF has a much better version in it. Moreover, since the synchronization stuff I'm looking for will end into a perl script, has anybody developed an equivalent of pysync in perl? I've found nothing on cpan. I think someone has written a perl interface to librsync... have a search of the rproxy project list archive on SF (the librsync project started as part of the rproxy project). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: offline rsync
On Wed, Nov 06, 2002 at 10:14:28AM -0600, Franco Bagnoli wrote: hello, I'm new to this list. here is my question: I would like to synchronize two computers (say the home one and the job one) using zip drives or similar (cdroms, etc), since modem lines are quite slow and expensive (in Italy). I though I could produce the signature of files on home computer, store it on a zip, go to job, run rsync to copy the missing or altered files on the disk (possibly in zipped form) and produce the new signature file, and repeat it once at home. I think this could also be seen as a backup system (on cdrom or similia). Is this feasible with rsync? Is there a better approach? rdiff, part of librsync. There is also pysync, but it's much slower (but easier to understand/modify). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Optimizations and other questions for rsync
On Wed, Oct 16, 2002 at 04:56:52PM -0700, jw schultz wrote: On Wed, Oct 16, 2002 at 05:00:02PM -0400, Farid Shenassa wrote: 1. is there any computational or disk IO difference between the rsync client and server (the one that does just the checksum on the block, vs the one that does rolling checksum). Given that I do not have as much cpu on the VOS machine, I would like the more expensive side to run on the Windows system. So I need to figure out who should run the daemon and who should push vs. pull. Once the connection is established it doesn't matter whether you push or pull. The difference is who sends and who receives. The receiver bears the brunt of the work. [...] Correction, it is the sender who bears the brunt of the work. It recieves the signature, then calculates and sends the delta. Calculating the delta is the CPU intensive task. This is the reason many archive sites don't like running rsync servers. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Mon, Oct 14, 2002 at 12:36:36AM -0700, Craig Barratt wrote: craig My theory is that this is expected behavior given the check sum size. [...] But it just occurred to me that checksum collisions is only part of the story. Things are really a lot worse. Let's assume that the block checksums are unique (so we don't get tripped up from this first problem). Let's also assume that the old file is completely different to the new one, ie: no blocks at any offset really match. So rsync will compare the checksum at every byte offset in the file looking for any match. If there are nBlocks blocks, each check has an nBlocks / 2^48 chance of a false match. Since this test is repeated at every byte offset, the probability that the file has no false matches is: p = (1 - nBlocks / (2^48)) ^ fSize where fSize is the size of the file (more precisely the exponent should be (fSize - 700)). Now for some numbers for 700 byte blocks: - 100MB file (104857600 bytes). nBlocks = 149797. p = 0.945. - 500MB file (524288000 bytes). nBlocks = 748983. p = 0.248. - 1000MB file (1048576000 bytes). nBlocks = 1497966. p = 0.003. So, on average, if you have a random new 1GB file and a random old 1GB file, and you rsync then, the 1st phase will fail 99.7% of the time. Someone could test this theory: generate two random 500MB files and rsync them. Try it a few times. I claim that on average the first pass will fail around 75% of the time. Things get a lot better when the files are very similar. For each block that matches, rsync skips the whole block (eg, 700 bytes) before it starts looking for matching checksums. So for a file that is identical it only does nBlock checks, not fSize checks (700 times fewer). Wow... scary. This can be generalised to; p = 1 - (1 - c/(2^b))^s where: p is the probablility of a false match c is the number of blocks in the signature b is the number of bits in the checksum s is the number of different bytes In conclusion, a blocksize of 700 with the current 48bit signature blocksum has an unacceptable failure rate (5%) for any file larger than 100M, unless the file being synced is almost identical. Increasing the blocksize will help, with the following minimum sizes being recommended for a 5% failure rate; fileblock 100M 1K 200M 3K 400M12K 800M48K 1G75K 2G 300K 4G 1.2M Note that the required block size is growing faster than the file size is, so the number of blocks in the signature is shrinking as the file grows. We absolutely need to increase the signature checksum size as the filesize increases. If my new hypothesis is correct we definitely need to increase the size of the first-pass checksum for files bigger than maybe 50MB. Does the first pass signature block checksum really only use 2 bytes of the md4sum? That seems pretty damn small to me. For 100M~1G you need at least 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than 64bits, but you should probably increase the block size as well/instead. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote: On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote: [...] Does the first pass signature block checksum really only use 2 bytes of the md4sum? That seems pretty damn small to me. For 100M~1G you need at least 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than 64bits, but you should probably increase the block size as well/instead. It is worth remembering that increasing the block size with a fixed checksum size increases the likelihood of two unequal blocks having the same checksums. I haven't done the maths, but I think the difference this makes is negiligable, and is far outweighed by the fact that a larger block size means less blocks. I think we want both the block and checksum sizes to increase with file size. Just increasing block size gains diminishing returns but just increasing checksum size will cause a non-linear increase in bandwidth requirement. Increasing both in tandem is appropriate. Larger files call for larger blocks and larger blocks deserve larger checksums. I do think we want a ceiling on block size unless we layer the algorithm. The idea of transmitting 300K because a 4K block in a 2GB DB file was modified is unsettling. I think that command-line overides are critical. Just as you can force a blocksize, you should be able to force a sigsumsize. However the defaults should be reasonable. BTW, the suggestions of using -c earlier in this thread ... from my reading of the man page, -c will only help if the files happen to be identical... And the reason rsync does a second pass that succeeds is it transmits the whole file if a final checksum of the file doesn't match after using the rsync algorithm. Someone else suggested using a locality limit to minimize the candidate blocks, and hence minimize the chance of errors... I think a locality heuristic might be good for speeding up finding a match, but a locality limit is no good because it sacrifices the chance of finding distant matches. part of rsync's benefits is it can handle relocated data, and a locality limit breaks that. Note for rsync2 or superlifter: We may want to layer the algorithm so that large files get a first pass with large blocks but modified blocks are accomplished with a second pass using smaller blocks. ie. 2GB file is checked with 500KB blocks and a 500KB block that changed is checked with 700B so rsyncing the file would be almost like rsyncing a directory with -c. layer algorithms with variable block sizes are tricky. At first glance it seems like a good idea, but when you go to implement it, it gets all messy. The big problem is miss data is not block aligned, and to handle re-location of data, you need to pretty much transmit a small blocksize signature of the whole file... by the time you do that you might as well have just used a smaller blocksize. Every time I think of layer/variable block tricks I think of another way that it might work, but every time I persue them, they hit problems... Another one that just sprang to mind; combining it with a client-side delta technique; 1) client sets blocksize to something large. 2) client requests signature from server for unknown parts of target file (ie the whole file on first parse, large unknown blocks on subsiquent passes) 3) server sends requested signature. 4) client rolls through basis, searching for match data, and populates a sparse new file with the hits. 5) client reduces blocksize (by half? by quarter?). 6) while blocksize is larger than some minimum, goto 2) 7) client requests remaining missing blocks and completes sparse new file. This could work... it could also have serious problems when you go to implement it... -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Mon, Oct 14, 2002 at 04:50:27PM -0700, jw schultz wrote: On Tue, Oct 15, 2002 at 02:25:00AM +1000, Donovan Baarda wrote: On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote: On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote: [...] Does the first pass signature block checksum really only use 2 bytes of the md4sum? That seems pretty damn small to me. For 100M~1G you need at least 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than 64bits, but you should probably increase the block size as well/instead. It is worth remembering that increasing the block size with a fixed checksum size increases the likelihood of two unequal blocks having the same checksums. I haven't done the maths, but I think the difference this makes is negiligable, and is far outweighed by the fact that a larger block size means less blocks. We've just seen one face of the checksum undersize. This problem is after all because we have unequal blocks that have the same (truncated) checksums. That is with 700 bytes being compressed to a 4 byte checksum. Increasing the block size without increasing the checksum size increases the chance of unequal blocks having the same checksum. Yeah, but I think when you do the maths, the variation this causes taipers to nothing as the block size gets significantly large compared to the checksum. All the maths we have been doing so far uses the simplifying assumption that the block size is infinitely large compared to the checksum, so the probability of a checksum clash is based entirely on the checksum size. But I haven't done the maths, so feel free to go through it and prove otherwise :-) what the heck... here it is; p = 1/2^b - 1/2^n where: p is the probability that any two blocks have the same checksum, but are actually different. b is the number of bits in the checksum n is the number of bits in a block. The first term is the probability any two random blocks have the same checksum. The second term is the probability that any two random blocks are actually the same. As you can see, as b - n, p - 0, but as n gets significantly larger than b, it has less and less affect. We are talking relative probability changes in the order 1/2^1000. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote: My theory is that this is expected behavior given the check sum size. Craig, Excellent analysis! I was a bit concerned about his maths at first, but I did it myself from scratch using a different aproach and got the same figures... for those who are interested, an OK approximation for the maths turns out to be; p = (c^2)/2^(b+1) where: p is probability of collision c is number of blocks b is number of bits in checksum. Provided c is significantly less than 2^b. As c approaches 2^b, p becomes too large using this approx(when c=2^b, p should be 1). Assuming your hypothesis is correct, I like the adaptive checksum idea. But how much extra processor overhead is there with a larger checksum bit size? Is it worth the extra code and testing to use an adaptive algorithm? There is no extra processor overhead, because the full md4sums are currently calculated anyway, before they are truncated down to minimise the signature size. I'd be more inclined to say This ain't the 90's anymore, realize that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV) and that people are moving from dialup to DSL/Cable, and then make either the default (a) initial checksum size, or (b) block size, a bit larger. librsync has options to specify the strong sum size, though I think it currently just uses the full md4sum size. pysync also uses the full md4sum size. I'm not sure what best aproach is, but there should be a relationship between block size, signature size, and file size. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Sat, Oct 12, 2002 at 07:29:36PM -0700, jw schultz wrote: On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote: My theory is that this is expected behavior given the check sum size. Craig, Excellent analysis! Assuming your hypothesis is correct, I like the adaptive checksum idea. But how much extra processor overhead is there with a larger checksum bit size? Is it worth the extra code and testing to use an adaptive algorithm? I'd be more inclined to say This ain't the 90's anymore, realize that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV) and that people are moving from dialup to DSL/Cable, and then make either the default (a) initial checksum size, or (b) block size, a bit larger. I lean toward making the block-size adaptive. Perhaps something on the order of 700 (filesize / 15000) 8KB Maybe both checksum size and block-size should be adaptive so that both track the file size so large files have larger but fewer checksums (compared to current defaults) and small files retain their current advantages. The ideal checksum size is totaly dependant on the number of blocks. If you scale your blocksize so that the number of blocks is kept low enough, you can stick with the fixed checksum size. Any change like this will require a protocol bump so should include the MD4 sum corrections as well. does it really need a protocol bump? The protocol already supports adaptive block sizes, so provided the default block size increases at a suitable rate based on the file size, that's all that is needed. It would be nice to also force a larger checksum size, but that probably would require a protocol change... (I'm more familiar with librsync than rsync so I could be wrong). -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Problem with checksum failing on large files
On Fri, Oct 11, 2002 at 03:26:45PM -0700, Terry Reed wrote: -Original Message- From: Derek Simkowiak [mailto:dereks;itsite.com] Sent: Friday, October 11, 2002 1:51 PM To: Terry Reed Cc: '[EMAIL PROTECTED]' Subject: Re: Problem with checksum failing on large files I'm having a problem with large files being rsync'd twice because of the checksum failing. I think this was reported recently. Please try using the -c option (always checksum) and see if the makes the problem go away. This is a high priority bug for me (although I have not yet experienced it). --Derek Using -c helps for the smallest file (900 MB), but has no effect on the larger files (e.g, 2.7 GB). Most of my files are between 1.5 GB 3 GB. I wonder if this is related to the rsync md4sum not producing correct md4sums for files larger than 512M? The existing rsync md4sum implementation does not produce md4sums the same as the RSA implementation for files larger than 512M... but I thought it was consistant with itself so this didn't affect anything. I posted a fixed md4sum implementation, but it would have required a new protocol version number and various hacks for backwards compatability with the old implementation before it could be used in rsync. Has someone rolled in a fixed md4sum implementation to rsync without checking to make it backwards compatible? If so, then ensuring you have the same version of rsync at both ends would avoid the problem. If not, then I dunno :-) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: MD4 bug in rsync for lengths = 64 * n
On Mon, Sep 02, 2002 at 12:44:48AM -0700, Craig Barratt wrote: [...] I can submit patches if required for the md4code as tweaked/fixed for librsync. The fixed code is faster as well as correct :-) Sure, that would be great. Otherwise, I would be happy to recreate and test a patch. I have attached my latest versions of the librsync mdfour code. This code is not yet available anywhere except here, but I will be posting it soon as a patch on the librsync project page at SF. This is the final refinement of this code I plan to do before migrating the API to match the RSA implementation. It includes bugfixes for sums larger than 512MB, and produces identical results to the RSA implementation. It is much faster than the RSA implementation or the previous samba/rsync derived implementation, and is much cleaner than the last librsync submitted optimisations. I'm not exactly sure how much faster it is, but rough tests using librsync show reductions of more than 20% for signature generation times, but I suspect some of that would be because I have also optimised the rollsums. If the rsync guys want to take advantage of the rollsum optimisations I have encapsulated them into a neat standalone rollsum.[ch] implementation. Let me know if you want them. BTW, I have also merged my librsync delta refactor changes with the latest SF librsync CVS and will be submitting a smaller cleaner patch. Are we still using the rproxy-devel list for librsync, or is there another list somewhere? -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- /*= -*- c-basic-offset: 4; indent-tabs-mode: nil; -*- * * librsync -- the library for network deltas * $Id$ * * Copyright (C) 2000, 2001 by Martin Pool [EMAIL PROTECTED] * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public License * as published by the Free Software Foundation; either version 2.1 of * the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public * License along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ #include types.h struct rs_mdfour { int A, B, C, D; #if HAVE_UINT64 uint64_ttotalN; #else uint32_ttotalN_hi, totalN_lo; #endif int tail_len; unsigned char tail[64]; }; /*= -*- c-basic-offset: 4; indent-tabs-mode: nil; -*- * * librsync -- the library for network deltas * $Id: mdfour.c 1.3 Thu, 27 Jun 2002 15:48:38 +1000 abo $ * * Copyright (C) 2000, 2001 by Martin Pool [EMAIL PROTECTED] * Copyright (C) 1997-1999 by Andrew Tridgell * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ /* MD4 message digest algorithm. * * TODO: Perhaps use the MD4 routine from OpenSSL if it's installed. * It's probably not worth the trouble. * * This was originally written by Andrew Tridgell for use in Samba. * It was then modified by; * * 2002-06-xx: Robert Weber [EMAIL PROTECTED] * optimisations and fixed 512M support. * * 2002-06-27: Donovan Baarda [EMAIL PROTECTED] * further optimisations and cleanups. */ #include config.h #include stdlib.h #include string.h #include stdio.h #include rsync.h #include trace.h #include types.h #include mdfour.h #define F(X,Y,Z) (((X)(Y)) | ((~(X))(Z))) #define G(X,Y,Z) (((X)(Y)) | ((X)(Z)) | ((Y)(Z))) #define H(X,Y,Z) ((X)^(Y)^(Z)) #define lshift(x,s) (((x)(s)) | ((x)(32-(s #define ROUND1(a,b,c,d,k,s) a = lshift(a + F(b,c,d) + X[k], s) #define ROUND2(a,b,c,d,k,s) a = lshift(a + G(b,c,d) + X[k] + 0x5A827999,s) #define ROUND3(a,b,c,d,k,s) a = lshift(a + H(b,c,d) + X[k] + 0x6ED9EBA1,s) /** padding data used for finalising */ static unsigned char PADDING[64] = { 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
Re: MD4 bug in rsync for lengths = 64 * n
Quoting Dave Dykstra [EMAIL PROTECTED]: Donovan or Craig, can you please post it as a patch to the current rsync source (including an increase in the protocol version and checks to do the old behavior when talking to an old version client). I'm not the person to do this, as I don't know the rsync code and protocol version stuff that well. I've been working on librsync instead. (handballs to Craig :-). I'm not sure what the best way of implementing the old behavior for talking to old versions would be. Perhaps implement a broken finalise in a seperate mdfour-old.c would be best. Then the the mdfour.[ch] can be common for librsync, rsync, and perhaps samba without including old broken cruft just for rsync. Probably we wouldn't want to incorporate it until a major release anyway (2.6.0) but if it isn't developed soon and put into the patches diretory it's likely to be forgotten. I've put the patches against librsync up on the librsync SF page. After I migrate it and librsync to the RSA implementation API I'll be submitting it to libmd. It would be nice if we had standard md4sum code that every project could use/debug/fix. -- ABO -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: MD4 bug in rsync for lengths = 64 * n
On Thu, Aug 29, 2002 at 03:17:34PM -0500, Dave Dykstra wrote: On Sun, Aug 04, 2002 at 01:19:43PM -0700, Craig Barratt wrote: [...] However, as I'm sure is well-known, the Adler crc32 and MD4 computed by librsync don't match those in rsync 2.5.5. I do not recall hearing anybody mention that before. I thought that the librsync and rsync delta etc was so radicaly different that you could never get them talking... I assume that this is all going towards making librsync and rsync talk, which would be nice. After swapping the crc32 endianess, changing RS_CHAR_OFFSET from 31 to 0, and adding rsync's checksum_seed to librsync's MD4 they now agree, except in one case: when the total data length (including the 4 byte checksum_seed) is a multiple of 64, the MD4 checksums in librsync and rsync don't agree. After reviewing the code in librsync, rsync and the original RSA implementation, I believe the bug is in rsync. It doesn't call mdfour_tail() when the last fragment is empty. Unfortunately this happens in the particularly common case of 700 + 4 = 64 * 11. The same bug occurs in both the block MD4s and the entire-file MD4. This is the first detailed description of the problem I've seen. I've heard it mentioned several times before, and thought that the md4 code in librsync was the same as in rsync. I've looked and tweaked the md4 code in librsync and could never see the bug so I thought it was a myth. I also thought that samba used this code I wonder what variant it is using :-) The bug is benign in the sense that it is on both sides so rsync works correctly. But it is possible (I am certainly not a crypto expert) that missing the trailing block (that includes the bit length) substantially weakens the computed MD4. The fix is easy: a couple of checks should be =. I can send diffs if you want. But of course this can't be rolled in unless it is coupled with a bump in the protocol version. Another bump in the protocol version is no problem. Please submit a patch. I can submit patches if required for the md4code as tweaked/fixed for librsync. The fixed code is faster as well as correct :-) I saw some earlier email about fixing MD4 to handle files = 512MB (I presume this relates to the 64-bit bit count in the final block). Perhaps this change can be made at the same time? Could you please post a reference to that email? It isn't familiar to me and I didn't find it through google. There have been other problems we've been seeing with with the end of large files and zlib compression, though. I wonder if it can somehow be related. It may not have been on the rsync list, but on the librsync list... Please note that there are several variants of the md4 patch floating around. I've been meaning to seperate the latest md4 patch from my bigger librsync delta refactor patch for some time. One of my goals for librsync is to make it use the same md4 API as the RSA code, as this seems to be more widely used and would allow drop-in use of the RSA implementation for verification. I'm planning to make the API changes then submit this implementation to libmd. Before the API change I'll release a final bugfix/optimize version using the old API. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: new rsync release needed soon?
On Wed, Jul 31, 2002 at 12:29:31PM -0600, Robert Weber wrote: On the subject of needed patches, I just recently completed a patch for librsync that fixed the mdfour code to have uint_64 or 2 uint_32's for size. Without this, the checksums on files 512Megs are incorrect. The code should drop into rsync without a hitch. The same goes for mdfour in samba, becuase that is where librsync got the code from anyway. Actually... I further refined and cleaned this patch and it's in my librsync delta refactor patch. I believe this patch also offers a significant performance increase... I can extract out just the mdfour changes into a seperate patch and submit it, just tell me where you want it. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: new rsync release needed soon?
On Wed, Jul 31, 2002 at 04:28:55PM -0700, Wayne Davison wrote: On Wed, 31 Jul 2002, Robert Weber wrote: On the subject of needed patches, I just recently completed a patch for librsync that fixed the mdfour code to have uint_64 or 2 uint_32's for size. Without this, the checksums on files 512Megs are incorrect. In order to interoperate with older versions of rsync, wouldn't we need to continue to generate the incorrect checksums on all but the newest (freshly bumped up) protocol number? ..wayne.. I'm not sure... it probably needs more analysis. rsync primarily uses mdfour sums as the big checksums on blocks in the signature. I think it highly unlikely that blocksizes of greater than 512M are used, so this change wouldn't affect that. However, it looks like rsync also uses mdfour for it's whole file checksums when using the -c option. In this case, any file larger than 512M would end up with a different whole file sum, causing it to do a full signature/delta/patch when it is not needed. However, the resulting delta would be very small, so this might be tolerable. The biggest worry is I think rsync also does whole file sums for any signature/delta/patch to double (triple?) check the final result. If this is the case, then yes, fixing mdfour will break old rsync'ing of files greater than 512M. If this is really an issue, I can cripple the patch with one line so that it conforms with the old mdfour code. This would give you the performance benefits, and a single line change to uncripple it further down the track. Alternatively I could add a modified finalise function to give crippled sums, allowing the application software to pick what kind of sum it wants. Someone on the librsync lists mentioned a while back that some rsync or samba person had identified a bug in the mdfour finalise routines and had a fix for it. I'm not sure if this 512M limit was it. I'd be interested to know if/how this affects samba. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: timestamp on symlink
On Mon, Jul 29, 2002 at 12:40:56PM +1000, Martin Pool wrote: On 28 Jul 2002, Michael Wang [EMAIL PROTECTED] wrote: rsync does not sync the timestamp on symlink (Solaris 8). It is probablly due to the limitation of Unix implementation of symlink, but I would like to know why rsync/Unix does not do this, and what we can do about it. Is the conclusion that rsync syncs everything except the timestamp on symlink? I don't think it is possible to set the time of a symbolic link. A quick experiment on Linux seems to confirm that: !891 12:36 ~% python2.2 Python 2.2.1 (#1, May 3 2002, 23:19:03) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type help, copyright, credits or license for more information. import os os.symlink('nothere', '/tmp/mylink') os.utime('/tmp/mylink', None) Traceback (most recent call last): File stdin, line 1, in ? OSError: [Errno 2] No such file or directory: '/tmp/mylink' This is because most of python's os.xxx methods de-reference symlinks. You get this error because 'nothere' doesn't exist. The correct way to get time info on symlinks is to use os.lstat(), which doesn't de-reference links. Python 2.1.3 (#1, Apr 20 2002, 10:14:34) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type copyright, credits or license for more information. import os os.symlink('nothere','mylink') os.lstat('mylink') (41471, 64782L, 10L, 1, 1000, 1000, 7L, 1027913015, 1027913005, 1027913005) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Block size optimization - let rsync find the optimal blocksize by itself.
On Tue, Jul 02, 2002 at 11:06:30AM +0200, Olivier Lachambre wrote: At 09:19 01/07/2002 +1000, you wrote: [...] This relies on optimal block size being related for a set of files. I'm not sure that this heuristic actually applies, and I don't know how much benefit this would buy for all the complexity it would add. I think that many clients do not care about multiplying complexity by 2 or 5, even if the speedup rate is only multiplied by 1.1. The only problem is, multiplying complexity by 2 or 5 multiplies the bugs by 2 or 5. Are those clients still excited by the 1.1x speedup now? The other problem with increasing complexity is the setting effect that complexity causes. The more complex code becomes, the harder it gets to modify, so that 1.1x speedup that introduced 5x complexity now makes it impossible to implement a much better 1.5x speedup. In my experience, complex solutions usually have worse performance than simple ones anyway. Occam's Razor rules :-) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Block size optimization - let rsync find the optimal blocksize by itself.
On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote: [...] Well, the first comment: during my work, I wanted to verify that the theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his PHd Thesis was actually the good one, and when doing the tests on randomly generated modified files I discovered that the size sqrt(78*n/Q) is the actual optimal block size, I tried to understand this by reading all the thesis, then quite a lot of documentation about rsync but I just can't figure out why the theorical experimental optimal block sizes so much don't match. I _really_ don't think it's coming from my tests, there must be somewhat else. Maybe the rsync developpers have just changed some part of the algorithm. And also, even without using data compression during the sync, rsync is always more efficient as it should be theorically, actually between 1.5 and 2 times more efficient. Nobody will complain about that but I'd be happy if someone would be nice enough to explain me this thing. I believe that the compression option turns on compression of transfered data, including file lists, instruction streams etc. Even with compression turned off, the miss data in the delta is still compressed. Another thing to be aware of is when rsync compresses miss data, it also compresses all the hit data to prime the compressor with context, and throws away the compressed output for the hit data. Perhaps this context compression is affecting the optimal block size? Now the auto-optimization algorithm when updating many files at a time. Let's consider a set of files to be updated. We will consider only the files which have been changed since the last update (e.g. we can find the other ones by sending a MD5 sum for each file and trying to match it). We sync the first file, but the client keeps the old local version and can find how many differences between the two files there is and then guess the optimal block size. We assume that the percentage of differences between the files is a bit the same in the same set of files. So we use for the second file the optimal size found for the first file. Then for the third file we use the (arithmetic or geometric?) average of the first two files and so on... Once we have synced a certain number of files (10? 100?) we always use the same size which is supposed to be the best one. Sorry I'm too long, hope you'll understand everything, Olivier This relies on optimal block size being related for a set of files. I'm not sure that this heuristic actually applies, and I don't know how much benefit this would buy for all the complexity it would add. P.S. I am not a programmer of any kind so don't wait for me to write any line of C (I know I'm a bad boy). It might be an interesting experiment to write a wrapper for rsync that accumulated stats on all transfers it was used for and learned the optimal block size to use. This would not require any changes to rsync, and could be written in something like Python. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Release 3 of rzync new-protocol test
On Fri, Jun 21, 2002 at 03:46:39AM -0700, Wayne Davison wrote: The count of transferred bytes in the latest protocol is now below what rsync sends for many commands -- both a start-from-scratch update or a fully-up-to-date update are usually smaller, for instance. This is mainly because my file-list data is smaller, but it's also because I reduced the protocol overhead quite a bit. Transferred bytes for partially-changed files are still bigger than rsync because librsync creates unusually large delta sizes (though there's a patch that makes it work much better, it's still not as good as rsync). I believe that the remaining difference is rsync does context compression using zlib. I believe librsync does no compression at all yet. Even if you zlib compress librsync's delta's, they will still be bigger than rsync because of the context it uses... it compresses the whole file, hits and misses, but only sends the compressed output for the misses. This means the compressor is primed with data from the hits. I think that the best solution for this is to do what xdelta is planning to do... toss zlib and include target references as well as source references in the delta instruction stream; do the compression yourself. One way to do this is implement xdelta-style non-block aligned matches against the target, building a rollsum hash-tree as you go through it, and run it alongside the rsync block match algorithm. However, this might not work well in practice... In my speed testing, one test was sending around 8.5 meg of data on a local system, and while rsync took only .5 seconds, my rzync app took around 2 seconds. A quick gprof run reveals that 98% of the runtime is being spent in 2 librsync routines, so it looks like librsync needs to be optimized a bit. One potential next steps might include optimizing rsync to make the transferred file-list size a little smaller (e.g. making the transfer of the size attribute only as long as needed to store the number would save ~4-5 bytes per file entry on typical files). It looks like work needs to be done on making librsync more efficient. I'm going to get onto this after this week end. I know what needs to be done... I just need the time to do it. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Strong encryption
On Wed, Jun 05, 2002 at 03:33:23AM -0700, 'jw schultz' wrote: On Wed, Jun 05, 2002 at 12:21:18PM +0200, C.Zimmermann wrote: If you want them stored on the destination encrypted you Yes, that?s it. The owner of the source files will be sure, that no one else can read his files on the destination host. I thought, rsync only looks at the modification date of a file and decides whether to backup this file or not. In this case, the backup could be stored encrypted. Rsync can handle encrypted files just fine. It just treats them as ordinary binary files. If the owner of the files encrypts them on the source they will be encrypted on the destination. As you have said rsync normally just looks at the modification date for deciding whether to update the destination (unless you use the -c option) But, unless the -w option is used rsync will use some rather clever (nod to rsync developers) methods to transfer only the changed the parts of changed files. It is this feature that gives rsync its speed. My comments below are to advise you that that clever feature is nullified by encrypted files. In fact for encrypted files rsync would be sub-optimal. If most/all of the changed files are encrypted i would use the -w option. Perhaps he wants a gpg-rsyncable patch to gpg? Seriously, such a thing would be possible. After the long thread in which I made a dick of myself discussing how gzip-rsyncable couldn't possibly work, only to have someone explain how it does, I now know how it could be done; encript the data in chunks, where the chunk boundaries are determined by the same heuristic gzip-rsyncable uses. I think I could even whip up a working Python wrapper around gpg in a day or so that could do the job... but I'm too buisy looking for paid work right now to do it so I'll leave it as an excersise for others :-) -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Compressed backup
On Tue, Jun 04, 2002 at 05:43:17PM +1000, Kevin Easton wrote: [...] If you'll indulge me, I'll just restate the problem (as I see it, anyway) before chiming in with my idea... [snip big discription of why gzip-rsyncable actually does work] Thanks for the discription of how gzip-rsyncable actually works. I should learn to do some more research before shooting my mouth off. I must have sounded pretty clueless... the heuristic reset idea is brilliant. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Compressed backup
On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote: On Fri, May 31, 2002 at 11:45:43AM +1000, Donovan Baarda wrote: On Thu, May 30, 2002 at 03:35:05PM -0700, jw schultz wrote: [...] I would guess that the number of changes meeting this criteria would be almost non-existant. I suspect that the gzip-rsyncable patch does nearly nothing except produce worse compression. It _might_ slightly increase the rsyncability up to the point where the first change in the uncompresstext occurs, but the chance of it re-syncing after that point would be extremely low. Actually many file modifications do just fine. The key being to recognize that any plaintext modification will alter the compresstext from that point to the end. Most content modifications alter the blocks nearest the end of the file. Think about how you edit text and Word processor documents. So it is not possible for rsync to get any matches on a gzip-rsyncable compressed file after the first modification. Does the gzip-rsyncable patch actually improve the rsyncability of compressed files at all? AFAIKT, files compressed normaly should be pretty much rsyncable up to the same point. Reseting the compression every 4K probably does allow you to rsync closer up to that point, but only because the resets make the compression less efficient... ie any savings from matching closer to the modification are lost because of the overall larger file. [...] This trend will also affect several other aspects of systems and network administration. We are rapidly approaching a day when most application files stored in home directories and shared work areas will be compressed. This means that that those areas will not benifit from network or filesystem compression. And our so-called 200GB tape drives will barely exceed 1:1 compression and only hold 100GB of these types of files. I expect non-application files to remain uncompressed for the forseeable future but we should recognize that the character of the data stored is changing in ways that disrupt the assumptions many of our tools are built upon. I think that the increased use of compressed files is going to require that rsync-like tools become compression aware, and be smart enough to decompress/recompress files when syncing them. I see no way around it, other than throwing heaps of bandwidth at the problem :-). Needless to say this will make the load on servers even worse. However server side signature caching and client side delta calculation would probably end up making the load on servers even lower than it currently is. [...] I don't think it is possible to come up with a scheme where the reset windows could re-sync after a change and then stay sync'ed until the next change, unless you dynamiclly alter the compression at sync time... you may as well rsync the decompressed files. The only way to do it is to make a content-aware compressor that compresses large chunks and then pads the compresstext to an aligned offset. That would be too much waste to be a good compression system. even this wouldn't do it... the large chunks would have to be split on identical boundaries over unchanged uncompressedtext in the basis and the target. The only way this could be achieved would be if the target was compressed using resets on boundarys determined by analysing the changes and boundaries used when the basis was compressed. If the end that has the target file has that degree of intimate knowege about the other end's basis file, then you can toss the whole rsync algorithm and revert to some sort of compressed xdelta. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Compressed backup
On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote: On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote: On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote: [...] When i said content-aware compressor what i meant was that the compressor would actually analize the plaintext to find semantically identifiable blocks. For example, a large HOWTO could be broken up by the level-2 headings This would be largely (not always) consistant across plaintext changes without requiring any awareness of file history. Have rsync be compression aware. When rsync hits a gziped file it could treat that file as multiple streams in series where it would restart the checksumming each time the compression table is reset. That's actually pretty clever... It provides a neat way around the XML meta-data at the front changing problem... just have the XML compression reset at suitable boundaries in the XML file... You wouldn't need rsync to be compression aware. Provided the adjacent unchanged segments between compression resets were larger than the rsync blocksize, you would only miss block fragments at the beginning and end of each matching sequence(as rsync already does). However, by making rsync gzip-reset aware, you could do interesting things with variable block sizes, where the file itself specifies the rsync block boundaries. Hmmm, more though needed on how this would integrate with the rolling checksum though... I suspect you could toss the rolling checksum and just look for matching blocks as defined by the resets, because its not going to match at an arbitary byte boundary anyway. I can't see this actually happening but it could work where the compression is done by the application that creates the file. If, and only if, that were to be done so that there were enough to be worthwhile then rsync could be made compression aware in this way but that would require a protocol change. putting content-aware compression resets at appropriate points will make files (a bit) rsyncable with rsync as it already stands. Making rsync compression-reset aware would only improve things a little bit. However, the resets _must_ be content-aware, they must occur on likely to match boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch currently does. Your idea of having rsync actually do the checksums on the plaintext of compressed files might have some merit in future. It would mean essentially that we would zcat the source twice and the destination would be ungzipped, merged and then regzipped. Gastly as far as CPU goes but would help save us network bandwidth which is growing at a lower rate. The questions are, what is the mean offset of first change as a proportion of file size and are enough files gzipped to merit the effort? This is what I believe xdelta already does for compressed files. The deltas are computed for the uncompressed data on files that are identified as compressed. I don't see why you would zcat the source twice... The destionation would ungzip the basis file, calculate and send a sig. The source would zcat the target, feeding it through the rolling checksum, looking for matches and sending the delta to the destination. The destination would receive and apply the delta, then gzip the result. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Compressed backup
On Sat, Jun 01, 2002 at 05:18:42PM -0700, jw schultz wrote: On Sat, Jun 01, 2002 at 11:46:37PM +1000, Donovan Baarda wrote: On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote: On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote: On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote: [...] putting content-aware compression resets at appropriate points will make files (a bit) rsyncable with rsync as it already stands. Making rsync compression-reset aware would only improve things a little bit. However, the resets _must_ be content-aware, they must occur on likely to match boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch currently does. chunk: section of file gzipped without reset. block: section of file that dest has checksumed. good terminology :-) let me add; basis : the oldfile on the destination that is being updated to the target target: the newfile on the source. The buy-back of the rolling checksums is to allow subsequent blocks to match at different offsets when a length change occurs mid-file. Without a gzip reset aware variable blocksizes rsync wouldn't gain anything from a gzip reset because the reset yes it would... would occur in the middle of a block. The rolling checksums would be droppable, however because matching blocks within a gzip chunk would not be relocatable within that chunk. yes it would, provided the block size is smaller than the compressed size of a sequence of unchanged adjacent chunks. The first and last block fragments of the matching sequence of chunks would be lost, but all the blocks in the middle would match. The rolling checksum would re-sync at the first complete matching block. Provided the context-sensitive location of compression resets ensures they occur at the same points in the target and basis around unchanged chunks, rsync will re-sync and find matches after any change, provided the matching sequence of chunks is larger than the blocksize. For any gain rsync blocks need to have block-aligned offsets within the gzip chunks. There are two things block-aligned offsets in gzip chunks buys you; slightly better sig's for the basis, and slightly less missed block fragments in the delta. The reason you get better sigs is any block with a reset in the middle has a useless checksum in the sig unless both chunks on either side of the reset are unchanged. The reason you get a slightly better delta is you avoid missing the block fragment at the beginning and end of a matching sequence of chunks. Both of these are related and negligable provided the block size is small compared to the size of matching sequences of chunks. These problems already apply to rsync on uncompressed files, and are the reason xdelta can produce more optimal deltas. Attempting to align compressor resets with block boundaries is akin to tweaking block sizes on uncompressed data; you can improve things a little, but the effort vs return is only really worth it for very special cases. If you align resets to block at the expense of locating them contextually, you will just make things worse. The first reason the destination zcats twice is that we might lack the disk space to store the uncompressed files. The only reason to go to this much work for gziped files is because there are many large ones. Therefore, we dare not leave them around uncompressed even when rsync works on one directory at a time. The other reason is that the scanning and generation processes should not alter the trees. I don't think you can efficiently get away with just using zcat when merging deltas. The reason is the basis has to be seekable... because rsync can find and use matches in files that have been re-ordered, you sometimes have to seek backwards. The only way I can see to 'seek' without a full decompression is to restart decompression every time you seek backwards. You can sortof improve this by preserving the decompressor state at various points so you can start seeking from there... I see a seekable-zcat class in my mind already :-) conclusion: I don't think gzip-rsyncable helps much at all, certainly not enough to warrent including it in the official gzip (figures that demonstrate otherwise are welcome). Context sensitive location of compression resets by applications that try to place resets at the same points around unchanged chunks will result in rsyncable compressed files (Compressed-XML'ers take note, a compression reset after the meta-data section and at significant points in the XML could make a huge difference). Application specific rsync-alike updates of compressed data with contextual location of compression resets will get better results if they toss the rsync rolling checksum and synchronise using sigs for 'chunks' rather than rsyncs 'blocks'. The effort vs return of accurately locating compression resets contextually around unchanged 'chunks' depends heavily on the type of data. The best
Re: Improving the rsync protocol (RE: Rsync dies)
On Mon, May 20, 2002 at 09:35:04PM +1000, Martin Pool wrote: On 17 May 2002, Wayne Davison [EMAIL PROTECTED] wrote: On Fri, 17 May 2002, Allen, John L. wrote: [...] I've been thinking about this too. I think the top-level question is Start from scratch with a new protocol, or try to work within the current one? tough question... to avoid backwards breakage and yet implement something significantly better you would probably have to make two rsyncs in one executable; the new protocol, and the old one for a compatible fallback talking to old versions. After enough time had passed and all old rsync implementations had been purged, the old code could be dropped, leaving a nice clean small solution. I tend to think that once a delta compressing http extension gets mainstream acceptance, rsync will fade away _unless_ it offers significantly better performance by avoiding http overheads (which is why ftp lives on, despite being a bastard protocol from hell). algorithm or codebase, or need to evolve the current one. I think the nature of the current protocol is that it will be hard to make really fundamental improvements without rewriting it. [...] My feelings too. I wrote librsync. There is some documentation and I can add more if there's anything undocumented. I really want to do some work on librsync. I recently did some work writing a Python swig wrapper for it and identified several areas where it could be improved. I already have a more modular implementation of the rolling checksum that is 2~3x faster that I want to integrate with it. I haven't looked at pysync as much as it deserves, but it could be a good foundation. Pysync now includes a Python extension module to librsync. I have also implemented a Python wrapper around the rolling checksum mentioned above that makes the Python version run nearly 2x as fast as the pure Python adler32 version. This will be released in the next release. I don't think Python is viable for a final rsync solution, even using librsync as an extension module. The Python interpreter is an un-necisary overhead for a lean-mean-data-transfer-machine. However, Python is _perfect_ as a means of experimenting with protocols and algorithms. Pysync was written with this in mind, and in ~400 lines of heavily commented Python implements both the rsync and xdelta algorithms. The next release will add inverse-rsync (for client-side delta calculation). Note that pysync _does_ implement the context-compression used by rsync that is not included in librsync yet. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: timeout error in rsync-2.5.5
On Fri, May 03, 2002 at 04:41:17PM -0500, Dave Dykstra wrote: You shouldn't need to have such a long timeout. The timeout is not over the whole length of the run, only the time since the last data was transferred. It's a mystery to me why it quits after 66 minutes rather than 5 hours, but the real question is why it stops transferring data for so long. Perhaps something went wrong with the network. I can't connect to that server to try it, perhaps it is behind a firewall. I have seen similar stalls with rsync running under apt-proxy with a timeout. However, in this case the rsync is just transfering a single file, so the directory transfer time is nothing and the stall occurrs after about 5 mins. With a timeout value of 15mins, the rsync client just sits there untill the 15min timout happens when it dies. I suspect something going wrong with the timeout negotiation between the client and server. If I do the transfer manually without a timeout it works fine. I haven't looked into this too hard yet, as I was not sure if it was something apt-proxy related, or perhaps a kernel 2.4.17 issue, or maybe something strange in the firewall in the middle etc. however, my recent test doing it manualy suggests it's probably an rsync problem. - Dave Dykstra On Tue, Apr 16, 2002 at 12:36:03PM -0400, Alberto Accomazzi wrote: Dear all, I've been trying to track down a problem with timeouts when pulling data from an rsync daemon and I have now run out of any useful ideas. The problem manifests itself when I try to transfer a large directory tree on a slow client machine. What happens then is that the client rsync process successfully receives the list of files from the server, then begins checking the local directory tree, taking its sweet time. Since I know that the process is quite slow, I invoke rsync with a timeout of 5 hours to avoid dropping the connection. Howerver, after a little over 1 hour (usually 66 minutes or so), the server process simply gives up. I have verified the problem under rsync versions 2.3.2, and 2.4.6 and up (including 2.5.5), testing a few different combinations of client/server versions (althoug the client is always a linux box and the server always a solaris box). It looks to me as if something kicks the server out of the select() call at line 202 of io.c (read_timeout) despite the timeout being correctly set to 18000 seconds. Can anybody think of what the problem may be? See all the details below. Thanks, -- Alberto CLIENT: [ads@ads-pc ~]$ rsync --version rsync version 2.5.5 protocol version 26 Copyright (C) 1996-2002 by Andrew Tridgell and others http://rsync.samba.org/ Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, IPv6, 64-bit system inums, 64-bit internal inums rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the GNU General Public Licence for details. [ads@ads-pc ~]$ rsync -ptv --compress --suffix .old --timeout 18000 -r --delete rsync://adsfore.harvard.edu:1873/text-4097/. /mnt/fwhd0/abstracts/phy/text/ receiving file list ... done rsync: read error: Connection reset by peer rsync error: error in rsync protocol data stream (code 12) at io.c(162) rsync: connection unexpectedly closed (17798963 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(150) SERVER: adsfore-15: /proj/ads/soft/utils/src/rsync-2.5.5/rsync --version rsync version 2.5.5 protocol version 26 Copyright (C) 1996-2002 by Andrew Tridgell and others http://rsync.samba.org/ Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, no IPv6, 64-bit system inums, 64-bit internal inums rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the GNU General Public Licence for details. from the log file: 2002/04/16 08:52:48 [18996] rsyncd version 2.5.5 starting, listening on port 1873 2002/04/16 09:39:01 [988] rsync on text-4097/. from ads-pc (131.142.43.117) 2002/04/16 10:51:36 [988] rsync: read error: Connection timed out 2002/04/16 10:51:36 [988] rsync error: error in rsync protocol data stream (code 12) at io.c(162) from a truss: adsfore-14: truss -d -p 988 Base time stamp: 1018964639.2848 [ Tue Apr 16 09:43:59 EDT 2002 ] poll(0xFFBE4E90, 1, 1800) (sleeping...) 4057.4093 poll(0xFFBE4E90, 1, 1800) = 1 4057.4098 read(3, 0xFFBE5500, 4) Err#145 ETIMEDOUT 4057.4103 time() = 1018968696 4057.4106 getpid()= 988 [18996] 4057.4229 write(4, 2 0 0 2 / 0 4 / 1 6 1.., 66) = 66 4057.4345
Re: rsync md4sum code.
On Sun, Apr 28, 2002 at 01:06:10PM +1000, Donovan Baarda wrote: On Sat, Apr 27, 2002 at 03:32:47PM -0700, Martin Pool wrote: On 27 Apr 2002, Donovan Baarda [EMAIL PROTECTED] wrote: G'day, I've been working on a Python interface to librsync and have noticed that it uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes via rsync and was originally written for samba. Tridge recently discovered a bug in that code that probably does not weaken the digest, but that may make it incompatible with standard MD4. Basically, tail-extension is not properly carried out for blocks that are a multiple of 64 bytes in size. This would be nealy all blocks, as everyone would be using 2^n sized blocks where n5. If you meant to say ...that are _not_ multiple of 64 bytes..., then I would dare to suggest fixing this would not hurt anybody, but definitely record the affects. I haven't had a chance yet to check how this affects rsync. If it does, I suppose we should evolve the protocol to fix it. There's not meant to be anything special about it. One of my TODO items was to replace it with a faster implementation. I'm not sure how the RSA implementation compares speed-wise, but given it is more correct, would there be major objections to replacing the samba md4 with the RSA one in librsync? I guess I should benchmark and publish results... Ok, preliminary results are in. Because I'm primarily interested in Python interface stuff at the moment, I chose to make my benchmark by hacking together swig wrappers for the libmd md4c.c and the librsync mdfour.c, and whipping up a python test rig. The wrappers should have identical overheads. I used pre-generated random data for the input. The results for a Celeron-366 doing 10K md4sums on 4K blocks of random data are; libmd 3.1secs, librsync 2.5secs. Both gave identical checksum results so the bug Tridge found didn't rear it's head. Conclusion: the librsync md4 is pretty fast. However, it was also slightly harder to swig-wrap in isolation from the rest of librsync, as it had a few tie-ins with the rs_trace stuff. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [rproxy-devel] Re: rsync md4sum code.
On Sun, Apr 28, 2002 at 11:48:14PM +1000, Donovan Baarda wrote: [...] Ok, preliminary results are in. Because I'm primarily interested in Python interface stuff at the moment, I chose to make my benchmark by hacking together swig wrappers for the libmd md4c.c and the librsync mdfour.c, and whipping up a python test rig. The wrappers should have identical overheads. I used pre-generated random data for the input. The results for a Celeron-366 doing 10K md4sums on 4K blocks of random data are; libmd 3.1secs, librsync 2.5secs. Both gave identical checksum results so the bug Tridge found didn't rear it's head. Additional results; libmd native module based on existing python md5module instead of using swig wrapper; 1.6secs. Conclusion: the librsync md4 is pretty fast. However, it was also slightly harder to swig-wrap in isolation from the rest of librsync, as it had a few tie-ins with the rs_trace stuff. also, the C implemention matters far less than how you interface it with Python. For the swig wrappers I was using shadow class's which adds an extra level of indirection. It looks like for the swig tests, 1.5secs was overhead, which makes the librsync md4 look even more impressive... -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
rsync md4sum code.
G'day, I've been working on a Python interface to librsync and have noticed that it uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes via rsync and was originally written for samba. Is there anything special about this code compared to the RSA md4sum code that can be found with libmd http://www.penguin.cz/~mhi/libmd/;? Python uses the RSA md5sum code for it's md5 module, and making a Python md4 module was as simple as sed 's/\(md\)5/\14/i' over the Python interface code. I'm contemplating some cleanup code hacking on librsync and was wondering if the samba based md4sum code was better than the RSA code, or if it is worth switching. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync md4sum code.
On Sat, Apr 27, 2002 at 03:32:47PM -0700, Martin Pool wrote: On 27 Apr 2002, Donovan Baarda [EMAIL PROTECTED] wrote: G'day, I've been working on a Python interface to librsync and have noticed that it uses md4sum code borrowed from Andrew Tridgell and Martin Pool that comes via rsync and was originally written for samba. Tridge recently discovered a bug in that code that probably does not weaken the digest, but that may make it incompatible with standard MD4. Basically, tail-extension is not properly carried out for blocks that are a multiple of 64 bytes in size. This would be nealy all blocks, as everyone would be using 2^n sized blocks where n5. If you meant to say ...that are _not_ multiple of 64 bytes..., then I would dare to suggest fixing this would not hurt anybody, but definitely record the affects. I haven't had a chance yet to check how this affects rsync. If it does, I suppose we should evolve the protocol to fix it. There's not meant to be anything special about it. One of my TODO items was to replace it with a faster implementation. I'm not sure how the RSA implementation compares speed-wise, but given it is more correct, would there be major objections to replacing the samba md4 with the RSA one in librsync? I guess I should benchmark and publish results... There would be backwards compatability issues for librsync and rdiff, but I'm hoping that these can be dealt with simply, hopefully by just bumping the major version number and documenting the issue. librsync is not as widely used as rsync itself so I don't think it would matter that much. I think the RSA/libmd code is more standard, and it would be wise for librsync at least to adopt the more widely used code. It is certainly nice that the md2, md4, and md5 API's are so interchangeable. The only reason not to would be licencing issues, but I would guess if Python can include it and be GPL compatible, then it should be OK. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: [rproxy-devel] rdiff deltas not very good compared to pysync, why?
On Wed, Apr 24, 2002 at 09:46:09PM -0400, Shirish H. Phatak wrote: Hi Martin, I can definitely help in managing the librsync package, especially since I will be maintaining our local tree anyway. As of now, I don't have the resources to get the Windows stuff tested; however, I can handle the Unix side of things. Maybe Donovan would be interested in managing the Windows effort? I am also interested in helping take over librsync mantainence. Ordinarily I'm a pure linux development person, but in this case I was sponsored by Accellion to do a Python wrapper that had to work under windows. I'm also pretty familiar with how the algo works from my work on Pysync. I am fine with either Sourceforge or samba.org. I have used the former quite extensively during Intermezzo development, but most of my usage was CVS access, so I don't think either site would make much of a difference. I really like a bug-tracker and the extra stuff SF gives you. I also find SF is a fairly centralised place for it all. However, if samba.org can provide the same facilities... -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Patch: update popt/ to 1.5.1
On Sat, Apr 20, 2002 at 12:23:16PM -0700, Jos Backus wrote: This patch updates the files under popt/ to the latest vendor drop. The only change is the inclusion of a FreeBSD-specific patch to popt.c. This is needed in case somebody decides to build rsync on that platform without using the port. I'm not happy about the wording in popt/README.rsync so I may change it. The patch is available at http://www.catnook.com/rsync-popt-1.5.1.patch I have just submitted a patch against librsync that included changes to the popt included in it for cygwin and MSVC compatability. Near as I can tell the popt I was patching was v1.3. Do you (or anyone else) know if the 1.5.1 version fixes cygwin and MSVC compilation? (by MSVC I mean doing configure CC=cl --build=windows32 in cygwin). Should I be submitting my changes somewhere else upstream? -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Can rsync update files in place?
On Tue, Apr 16, 2002 at 09:43:12PM +0200, Paul Slootman wrote: On Tue 16 Apr 2002, Martin Pool wrote: On 16 Apr 2002, Paul Slootman [EMAIL PROTECTED] wrote: Is there a way to get rsync to not create a new file while transferring and then rename it, but to instead update the existing file in place, i.e. simply write those blocks that have been updated and leave the rest alone? That would be pretty hard, though not impossible. I suspect it would be roughly as hard to put it into rsync as to write a new program from scratch, depending on what features you need. That's what I've been thinking :-) If someone wanted to experiment with this kind of thing, plug pysync would be an ideal prototyping tool/plug. It's a Python implementation of rsync that I wrote for the express purpose of being an easy way to demonstrate and experiment with the algorithm. It includes both rsync and xdelta algorithms, and is about 500 lines in total, but half of that is comments. http://freshmeat.net/projects/pysync/ I am currently working on including a swig'ed wrapper for librsync, that should mean you could use it for real-world (Python) applications. The plan is to have this ready by the end of this week. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: rsync and debian -- summary of issues
On Fri, Apr 12, 2002 at 01:28:01PM +1000, Martin Pool wrote: On 12 Apr 2002, Brian May [EMAIL PROTECTED] wrote: I think some more details is required regarding rproxy. [...] AFAIK, it solves all the problems regarding server load discussed in rsync, doesn't it??? Why did you think that? rproxy addresses a rather different problem to rsync: for example, it transfers only one file at a time, not whole directories. No, rproxy does not have a magic way to do the delta computation in zero time. Compared to rsync, rproxy has the advantage of cleaner code (imho), but the disadvantage that no optimization work has been done. The big problem with rproxy is it's implemented in perl (perl: crypto for algorithms). librsync on the other hand is a nice piece of work and _should_ be used for a re-implementation of rproxy and/or rsync. I have recently got sponsorship to add a python front-end to librsync as an extension to my pysync code, which I should have done by end of next week. After this I will probably do some work on adding rproxy http extensions into something like medusa or twisted. If rproxy includes the signature in a HEAD responce, and supports ranged requests, delta calculation can be moved to the client with no changes to rproxy as follows; 1) client sends HEAD request to get upstream signature. 2) client does reverse-delta calculation. 3) client applies reverse-delta using ranged-requests to fetch required parts from upstream. You touched on this in your page, but not in relation to rproxy. I believe the client-side delta stuff you mentioned was not using rproxy http extensions at all, just adding some sort of cgi that returns signatures for objects on the server. -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key -- -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
Re: Rsync and Java
On Sun, Mar 11, 2001 at 08:52:25PM -0500, Andre John Mas wrote: Martin Pool wrote: On 11 Mar 2001, Andre John Mas [EMAIL PROTECTED] wrote: I am looking at possibly doing an rsync port to Java, depending on time constraints, and I would like to know whether anyone has already set up such an effort. If there is one already under way, then it wouldn't make sense for me to duplicate the effort. I don't think there is one yet. You're welcome to do it. I think Java would work quite well for this sort of thing. There is some example code in other languages (e.g. pysync) that may guide you. I will look at the python code, as suggested, as this should help me understand the necessary components. A little warning about the pysync... it was developed without attempting to be compatible in any way with existing rsync/xdelta/libhsync etc. It performs the same algorithms as them, but the data formats take advantage of python data structures. It also favors algorithmic cleanness over performance tweaks, for example it uses a pure adler32 for the rolling checksum instead of the hybrids used by rsync and xdelta. It also has zero network-transport stuff; it just calculates signatures, calculates deltas, and applies them. Any network transport stuff is left to the reader as a simple excersize :-) This has it's advantages; the code is very simple. It is now just over 500 lines, of which nearly 50% are comments, and this includes both rsync and xdelta style deltas. Actualy, the xdelta delta stuff is new... I'll put up a new release that has it in it right now. I've also done some tweaks that have made it 33% faster than version 1.2... -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key --
[Module] pysync 1.2
pysync 1.2 -- A Python implementation of the rsync algorithm. This is a demonstration implementation of the rsync algorithm in Python. It is not fast and is not optimised. The primary aim is to provide a simple example implementation of the algorithm for reference, so code clarity is more important than performance. Ideas have been liberaly taken from libhsync, xdelta and rsync. Release 1.2 introduced the new zlib-like API, allowing for incremental calculation of deltas and applying patches. The comments at the top of pysync.py explains it all; # Low level API signature calculation sig=calcsig(oldfile) # Low level API rsync style incremental delta calc from sig and newdata delta=rdeltaobj(sig) # or for xdelta style incremental delta calc from oldfile and newdata # delta=xdeltaobj(oldfile) incdelta=delta.calcdelta(newdata) : incdelta=delta.flush() # Low level API applying incremental delta to oldfile to get newdata patch=patchobj(oldfile) newdata=patch.calcpatch(incdelta) : The rdeltaobj.flush() method supports R_SYNC_FLUSH and R_FINISH flush modes that behave the same as their zlib equivalents. Next on the TODO list is incremental signature calculation, and further cleanups. Eventualy I plan to create a md4sum module and move the rolling checksum stuff into C code. The performance has been marginaly hurt by this new API. Interestingly, the python profiler shows that most of the time is wasted performing string-copies when taking slices from input buffers, not actualy doing the rsync. This suggests that significant performance increases might be achievable by re- arranging things a bit, rather than moving python code into C. I have also added a pysync-test.py script for thorough formal testing of pysync. It generates/reuses random test files that make pysync really work hard, verifying that it behaves as it should. Incidentaly, release 1.2 also fixed a rather embarassing bug introduced in release 0.9's adler32.py that corrupted the rolling checksums, resulting in heaps of missed matches. This bug caused seriously bad performance and very large deltas. URL: http://freshmeat.net/projects/pysync/ Download: ftp://minkirri.apana.org.au/pub/python/pysync/pysync-1.2.tar.bz2 License: LGPL Categories: Encryption/Encoding Donovan Baarda ([EMAIL PROTECTED]) http://sourceforge.net/users/abo/ -- ABO: finger [EMAIL PROTECTED] for more information.
Re: Q (feat. req.) rsync: using UDP, multicast
Quoting Martin Pool [EMAIL PROTECTED]: On 14 Feb 2001, "Ph. Marek" [EMAIL PROTECTED] wrote: Hello once more, [...] I know that's not easy achievable. But I think it could suffice to say "push the changes made from backup/filexyz to current/filexyz on IP 224.x.x.x:y" and everywhere else "use IP 224.x.x.x:y to receive changes". [...] Hmm, thinking about this I think that that could be done using diff and spooling the difference on multicast. Sounds like an application for xdelta. If you have copies of both backup and current on the "source" machine, xdelta can calculate a more optimal delta than rsync. Then you just broadcast the delta. -- ABO: finger [EMAIL PROTECTED] for more information.
Re: 60 byte file hangs solaris 8 rcp
On Mon, Jan 15, 2001 at 10:00:31AM -0600, Dave Dykstra wrote: We have run across a Solaris 8 system that will hang the TCP connection whenever we try to rcp in the attached 60 byte file from any other system. The system administrator is not very responsive so I don't know if he has contacted Sun about it or not yet, and I don't have easy access to another Solaris 8 system. There must be something special about the data because it started out in another portion of a larger file (first noticed by rsync) and we were able to narrow it down to this small piece. I also don't know if there is any special networking hardware on this machine, all I know is that it is an Enterprise 450 with dual processors. Wow! That must be the definitive example of "narrowing it down". -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key --
pysync: Python implementation of rsync algorithm.
G'day again, The first version of pysync turned out to be a bit buggy... release early and release often is my excuse :-) Actualy, I tripped up over an obsure bug in Python's zlib to do with decompression of sync flushes. The workaround is to decompress in smaller chunks. This makes the code more complicated but also makes it a bit less memory hungry when applying deltas to large files. There was another bug I've fixed that would cause patches to fail if the first block wasn't a match. I've also cleaned up the code a little, and tested it a lot more this time. I've even tested it on some pretty big files, and I'm pleased that it doesn't seem to be as slow as I thought it would be. The new version is at; ftp://minkirri.apana.org.au/pub/python/pysync-0.6.tar.bz2 -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key --
Python implementation of rsync algorithm, was Re: rdiff - does it exist?
On Wed, Dec 06, 2000 at 11:43:18AM +1100, Donovan Baarda wrote: G'day, Quoting Martin Pool [EMAIL PROTECTED]: [...] I'm changing libhsync to have an API much like zlib/bzlib, in the hope that this will make network app integration simpler. [...] I am 90% of the way through an "example implementation" of exactly this in pure python. I should be finished within a week (it's only a couple of hours work, Well, I finaly got around to finishing this off enought to seek comment. I have implemented the rsync algorithm in pure python. It's not fast, but it works, and is very simple. The whole thing is about 300 lines, and half of those are explanations and musings in comments. I've basicly implemented the commandline api Martin was talking about... so you can generate signature files, generate delta files, and apply delta files. There is no network transport stuff, though experienced pythonistas could easily whip something up if they wanted to. The signature and delta files are just pickled python objects, so they are not compatible with anything else. My bandwidth sucks a bit, so please be gentle. If I start getting hammered I'll put it up on sourceforge code-snippets or something. For now, it can be downloaded from; ftp://minkirri.apana.org.au/pub/python/pysync-0.3.tar.bz2 I must confess that the source to xdelta, libhsync, and rsync all scared me a bit which is why I thought I'd write a simple version. Experienced rsyc coders, please don't laugh, but I'd appreciate constructive critisism. In particular, I'd like to know if I've missed any tricks in the algo'. I'd like to think that this could form a framework for experimenting with various tweaks to the algorithm, but I'm probably dreaming :-) Feel free to forward this to anyone or any other lists you might think would be interested... -- -- ABO: finger [EMAIL PROTECTED] for more info, including pgp key --