On 21 Jul 2002, jw schultz <[EMAIL PROTECTED]> wrote: > .From what i can see rsync is very clever. The biggest > problems i see with its inability to scale for large trees, > a little bit of accumulated cruft and featuritis, and > excessively tight integration.
Yes, I think that's basically the problem. One question that may (or may not) be worth considering is to what degree you want to be able to implement new features by changing only the client. So with NFS (I'm not proposing we use it, only an example), you can implement any kind of VM or database or whatever on the client, and the server doesn't have to care. The current protocol is just about the opposite: the two halves have to be quite intimately involved, so adding rename detection would require not just small additions but major surgery on the server. > What i am seeing is a Multi-stage pipeline. Instead of one > side driving the other with comand and response codes each > side (client/server) would set up a pipeline containing > those components that are needed with the appropriate > plumbing. Each stage would largly look like a simple > utility reading from input; doing one thing; writing to > output, error and log. The output of each stage is sent to > the next uni-directionally with no handshake required. So it's like a Unix pipeline? (I realize you're proposing pipelines as a design idea, rather than as an implementation.) So, we could in fact prototype it using plain Unix pipelines? That could be interesting. Choose some files: find ~ | lifter-makedirectory > /tmp/local.dir Do an rdiff transfer of the remote directory to here: rdiff sig /tmp/local.dir /tmp/local.dir.sig scp /tmp/local.dir.sig othermachine:/tmp ssh othermachine 'find ~ | lifter-makedirectory | rdiff delta /tmp/local.dir.sig - ' >/tmp/remote.dir.delta rdiff patch /tmp/local.dir /tmp/remote.dir.delta /tmp/remote.dir For each of those files, do whatever for file in lifter-dirdiff /tmp/local.dir /tmp/remote.dir do ... done Of course the commands I've sketched there don't fix one of the key problems, which is that of traversing the whole directory up front, but you could equally well write them as a pipeline that is gradually consumed as it finds different files. Imagine lifter-find-different-files /home/mbp/ othermachine:/home/mbp/ | \ xargs -n1 lifter-move-file .... (I'm just making up the commands as I go along; don't take them too seriously.) That could be very nice indeed. I am just a little concerned that a complicated use of pipelines in both directions will make us prone to deadlock. It's possible to cause local deadlocks if e.g. you have a child process with both stdin and stdout connected to its parent by pipes. It gets potentially more hairy when all the pipes are run through a single TCP connection. I don't think that concern rules this design out by any means, but we need to think about it. One of the design criteria I'd like to add is that it should preferably be obvious by inspection that deadlocks are not possible. > timestamps should be represented as seconds from > Epoch (SuS) as unsigned 32 int. It will be >90 years > before we exceed this by which time the protocol > will be extended to use uint64 for milliseconds. I think we should go to milliseconds straight away: if I remember correctly, NTFS already stores files with sub-second precision, and some Linux filesystems are going the same way. A second is a long time in modern computing! (For example, it's possible for a command started by Make to complete in less than a second, and therefore apparently not change a timestamp.) I think there will be increasing pressure for sub-second precision in much less than 90 years, and it would be sensible for us to support it from the beginning. The Java file APIs, for example, already work in nanoseconds(?). Transmitting the precision of the file sounds good. > I think by default user and groups only be handled > numerically. I think by default we should use names, because that will be least surprising to most people. I agree we need to support both. Names are not universally unique, and need to be qualified, by a NIS domain or NT domain, or some other means. I want to be able to say: map "MAPOOL2@ASIAPAC" <-> "[EMAIL PROTECTED]" <-> "[EMAIL PROTECTED]" when transferring across machines. We probably cannot assume UIDs are any particular length; on NT they correspond to SIDs (?) which are 128-bit(?) things, typically represented by strings like S1-212-123-2323-232323 So on the whole I think I would suggest following NFSv4 and just using strings, with the intreptation of them up to the implementation, possibly with guidance from the admin. > When textual names are used a special chunk in the > datastream would specify a "node+ID -> name" > equivalency immediately before the first use of that > number. It seems like in general there is a need to have a way of "interning" strings (users, files, ...?) to shorter representations. On the other hand, perhaps this is an overoptimization and just using compression, at least at first, would be more sensible. > In the tree scan stage the first time we hit a given > inode with st_nlink > 1 we add it to a hardlink list and > decrement st_nlink. Each time we find another path > that references the inode we indicate it is a link > in the datastream and decrement st_nlink of the one > in our list. When the entry in the list has > st_nlink == 0 we remove it from the list. Yes, that's the right algorithm. It may need some refinement to be safe with filesystems changing underneath us. -- Martin -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html