This is a call for comments, regarding what you do expect when copying multible source tree roots leading to the same directory root, using rsync.
This problem may be discussed now, because in versions before rsync-2.5.6, the algorithm for removing the so called "duplicated files" was broken. That's why we expect nobody used it anyway in earlier versions - but who knows.. Example: -------- Whe have treeA/ treeB/ treeC/ and treeD/, which may or may not have common files or directories in the depth. the contense of a file in treeA/etc/etctest for e.g. may differ from treeD/etc/etctest. In my example, treeA/etc/etctest contains the line "treeA", and alike. Let's look what's the way of standard unix programs: $ mkdir /tmp/test $ cp -r tree*/* /tmp/test/ $ cat /tmp/test/etc/etctest treeD $ rm -rf /tmp/test/* $ scp -r tree*/* thomas@localhost:/tmp/test [..] $ cat /tmp/test/etc/etctest treeD $ Using these programs, the last reference of a file wins. Nb: this is, what a user should suspect. Note that each duplicated file has been copied / transmitted. rsync has routine which removes duplicates from given source-trees before actually transmitting the contense to the destination. Due to a bug in earlier versions, it was hardly predictable which file from which source-tree would win. Especially, if for e.g. some trees in the middle or the last referenced trees did not have such a file. Wayne Davison has done a fix for version 2.5.6. A while ago, i advised a patch to the list (http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html). Our both patches do fix the problem, but with a contrary result: With my patch the last reference of a duplicated file wins (in my example, treeD's etc/etctest); with Wayne's fix the first reference succeeds (treeA's etc/etctest). When looking at the code, i assume my proposal was initially intended; then, some years ago, the code was changed and the duplicate file deletion mechanism went broken. The project where i use rsync relys on a correctly working and predictable file-deletion mechanism. Therfore, this method should not change or flip/flop in future versions. That's why i like to discuss the both approaches, leading to one final proposal. If you're interested in detail, read my discussion with Wayne, who allowed me to quote. Regards, - Thomas Osterried ===== My proposal for a change for rsync-2.5.6: ----------------------------------------- diff -r rsync-2.5.6/flist.c ../rsync-2.5.6-patched/flist.c 1274c1274 < name, i); --- > prev_name, i-1); 1281c1281 < flist->files[i][0] = null_file; --- > flist->files[i-1][0] = null_file; 1283c1283 < free_file(flist->files[i]); --- > free_file(flist->files[i-1]); ===== Our discussion -------------- My mail: Subject: rsync: duplicate name deletion - problems with rsync-2.5.6 Dear Wayne, i'd like to diskuss some problems with your fix of rsync-2.5.6: | * Fixed a crash that would occur when sending a list of files that | contains a duplicate name (if it sorts to the end of the file | list) and using --delete. (Wayne Davison) my initial proposal for a fix i posted, as suggested at the project's web site, to the rsync mailing list was: http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html unfortunately, the only response to this issue was "Huh? I don't care about this at all." our both fixes for clean_flist() work. but with your fix, the clearing process will delete the second and following references of duplicated files. while in my approach the last reference of a duplicated file wins. i think, that deleting all but the last reference - is the right way because it's exactly what you expect and are experienced with, for e.g. when copying with "cp tree*/etc/etctest dest/etc/") - was the idea of the author of clean_flist(), because - he annonced rprintf(FINFO, "removing duplicate name %s from file list %d\n", f_name(flist->files[i - 1]), i - 1); - because he startet his for() loop with i = 1, comparing the backward list-entries at i-1. the only mistake was, to free the current list entry instead of the previous. see 1). because the rsync protocol version number (26) is the same, i'm unable to detect which behaviour of rsync considering duplicates i should expect at the target site. personaly, i would much appreciate that the last reference of a duplicated file wins. we use rsync for synchronizing a pool of 174 computers at university. rsync is the heart of the conzept where we have one linux installation on a master server (the "class tree"), several hardware- and configuration- memberships ("group trees") and machine-specific configurations. here we make intensively use of rsync's file-deletion mechanism, and rely on the perfect functionality. as an example, an rsync process with its arguments: 21934 ? R 0:00 /simpel2/public/install-linux-i386/rsync -a -W --timeout=1800 --numeric-ids --port 288 -v --stats --delete --exclude-from=/simpel2/adm/exclude_from_class_tree.txt /simpel2/trees/classes/linux-i386/ws/ /simpel2/trees/kernel/linux-2.4.19/modules-i386/ /simpel2/trees/groups/auto/linux-lilo-hda/ /simpel2/trees/groups/all-defaults/ /simpel2/trees/groups/linux-defaults/ /simpel2/trees/groups/all-lilo/ /simpel2/trees/groups/all-inf/ /simpel2/trees/groups/linux-X11sis/ /simpel2/trees/machines/ws/inf.fu-berlin.de/pc145/ /simpel2/public/update-all/ [EMAIL PROTECTED]::simpel-update/ my suggestion is, to change your fix like in 2). Appendix 1: ----------- this bug lead to a flipflop sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids --delete treeA/ treeB/ treeC/ treeD/ dest/ sh-2.05a# cat dest/etc/etctest treeA sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids --delete treeA/ treeB/ treeC/ treeD/ dest/ sh-2.05a# cat dest/etc/etctest treeC sh-2.05a# /export/local-1/src/rsync-origs/rsync-2.5.5/rsync -a -W --numeric-ids --delete treeA/ treeB/ treeC/ treeD/ dest/ sh-2.05a# cat dest/etc/etctest treeA sh-2.05a# ..and is also wrong, because i expected treeD. if i delete treeC, then it flipflops between treeA and treeD. that's clear du to the fact that the current list reference was free'd instead of the previous list entry, and thus the next while-loop could not detect that the current reference is also a duplicate of the previous or pre-previous. [..] ----- Waye: Subject: Re: rsync: duplicate name deletion - problems with rsync-2.5.6 > my initial proposal for a fix i posted, as suggested at the project's > web site, to the rsync mailing list was: > http://www.mail-archive.com/rsync@lists.samba.org/msg03117.html > unfortunately, the only response to this issue was "Huh? I don't care > about this at all." I wish I had remembered that post before I worked up a patch, but I didn't. I recall at the time being surprised that rsync even allowed --delete to work in a directory-merging situation, but since I wasn't working on the rsync project at the time, I obviously forgot about it. One thing I did do when I worked up the patch was to look through CVS. I discovered that the change occurred in revision 1.26 -- originally released as rsync 1.7.0. Because of this, I thought that the safest fix would be to continue on with the current practice of deleting the second entry from the list rather than the first. In tridge's CVS comments he doesn't mention making this change consciously, but I couldn't discount that it was intentional, so I left it as it had been for all the recent versions. I also didn't bring the issue up on the list, which was a mistake. > personaly, i would much appreciate that the last reference of a > duplicated file wins. Unfortunately for us, we are doing duplicate-removal on a sorted list of filenames, so we can't guarantee that the order of duplicate-removal follows the order specified on the command-line. For instance, if, in your following example, you had specified the directories in the order "treeB/ treeD/ treeA/ treeC/", the current duplicate-removal would leave treeA's version, and your suggested fix would leave treeD's version. The only fix for this would be to override the ASCII path sort with a numerical-ID when the names come from the command-line (or something like that). I don't know if this would be worth the trouble. > my suggestion is, to change your fix like in 2). I think we should discuss this on the list and see what people think. I would only hesitate to change it because of how long the change has been released this way. On the positive side, this change does not affect most people's use of rsync, so it would probably be safe. > [example merge combined with] this bug lead to a flipflop The flipflop was because rsync wasn't deleting all the duplicates, just every other extra name out of the sorted list. This is a bug that got fixed in the latest release, so the flipflop is now gone, and the result will always be from treeA (at the moment). Thanks for bringing this issue up. If you would care to bring the issue to the wider audience of the rsync mailing list, you may feel free to quote any of this reply. Or just let me know what you think of my comments. ..wayne.. ----- My answer: On Wed, Feb 05, 2003 at 11:18:24AM -0800, Wayne Davison wrote: > On Wed, Feb 05, 2003 at 02:40:25PM +0100, Thomas Osterried wrote: > I wish I had remembered that post before I worked up a patch, but I > didn't. I recall at the time being surprised that rsync even allowed > --delete to work in a directory-merging situation, but since I wasn't > working on the rsync project at the time, I obviously forgot about it. ;) when i looked how to build our sync-list, i was glad that rsync, which we already used in an early "flat hierarchy" version of our project, could do this "out of the box". what i mentioned with flipflop, yes this was in pre 2.5.6 versions (before your official and my private fix). because it was broken and not determistic, i think nobody used it in a critical environment - and if (and fixed it), we have a 50/50 chance if he changed it in your way or in mine ;) > > personaly, i would much appreciate that the last reference of a > > duplicated file wins. > > Unfortunately for us, we are doing duplicate-removal on a sorted list of > filenames, so we can't guarantee that the order of duplicate-removal > follows the order specified on the command-line. For instance, if, in we use this system for abt 1.5 years now. experience by practise is, that the sort-algorithm leaves the original priority list intact. looking at the code, i think you mean the result of qsort() could change the position of the duplicate files in respect to the source tree and my "priority list" _may_ change? > your following example, you had specified the directories in the order > "treeB/ treeD/ treeA/ treeC/", the current duplicate-removal would leave > treeA's version, and your suggested fix would leave treeD's version. > The only fix for this would be to override the ASCII path sort with a > numerical-ID when the names come from the command-line (or something > like that). I don't know if this would be worth the trouble. no, the alphanumeric name of the source directory has and should have no influence, but the position in the list. in the list "treeB/ treeD/ treeA/ treeC/" the result will be - treeB with your patch (first reference wins) - treeC with my proposal (last reference wins) let's see (should work): your rsync: $ /export/local-1/src/rsync-origs/rsync-2.5.6/rsync -a -W --numeric-ids --delete treeB/ treeD/ treeA/ treeC/ dest/ $ cat dest/etc/etctest treeB $ my patche'd rsync 2.5.5: $ /export/local-1/src/rsync-2.5.5/rsync -a -W --numeric-ids --delete treeB/ treeD/ treeA/ treeC/ dest/ $ cat dest/etc/etctest treeC $ my proposed change from Appendix 2 of my last mail in your rsync 2.5.6 version: $ /export/local-1/src/rsync-2.5.6-patched/rsync -a -W --numeric-ids --delete treeB/ treeD/ treeA/ treeC/ dest/ $ cat dest/etc/etctest treeC $ fine. > > my suggestion is, to change your fix like in 2). > > I think we should discuss this on the list and see what people think. I > would only hesitate to change it because of how long the change has been > released this way. On the positive side, this change does not affect > most people's use of rsync, so it would probably be safe. and because due to the bug the result was -for the last years- inpredictable. but i agree, i think it's a good idea to post and wait for responses on the list. i'll do this job. [..] -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html