Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Craig Barratt
  Is it possible to tell rsync to update the blocks of the target file=20
  'in-place' without creating the temp file (the 'dot file')?  I can=20
  guarantee that no other operations are being performed on the file at=20
  the same time.  The docs don't seem to indicate such an option.
 
 No, it's not possible, and making it possible would require a deep
 and fundamental redesign and re-implementation of rsync; the result
 wouldn't resemble the current program much.

I disagree.  An --inplace option wouldn't be too hard to implement.
The trick is that when --inplace is specified the block matching
algorithm (on the sender) would only match blocks at or after that
block's location (on the receiver).  No protocol change is required.
The receiver can then operate in-place since no matching blocks are
earlier in the file.  This could be relaxed to allow a fixed number
of earlier blocks, based on the knowledge the receiver will buffer
reads.  But that is more risky.  Caveat user: if you specify --inplace
and the source file has a single byte added to the beginning then the
entire file will be sent as literal data.

Of course, a major issue with --inplace is that the file will be
in an intermediate state if rsync is killed mid-transfer.  Rsync
currently ensures that every file is either the original or new.

Another independent optimization would be to do lazy writes.  Currently,
if you specify -I (--ignore-times) the output file is written (to a tmp
file and then renamed) even if the contents are identical.  Instead,
creation of the tmp file could be delayed until the output file is
known to be different.  This is detected either by an out-of-sequence
block number from the sender, or any literal data.  If the file contains
only in-sequence block numbers and no literal data, then there is no
need to write anything.

Craig
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Ben Escoto
 CB == Craig Barratt [EMAIL PROTECTED]
 wrote the following on Wed, 05 Feb 2003 04:41:22 -0800

  CB Of course, a major issue with --inplace is that the file will be
  CB in an intermediate state if rsync is killed mid-transfer.  Rsync
  CB currently ensures that every file is either the original or new.

I'm curious, how does it ensure this?


-- 
Ben Escoto



msg06453/pgp0.pgp
Description: PGP signature
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Paul Haas
On Wed, 5 Feb 2003, Craig Barratt wrote:

 Of course, a major issue with --inplace is that the file will be
 in an intermediate state if rsync is killed mid-transfer.  Rsync
 currently ensures that every file is either the original or new.

I hate silent corruption.  Much better to have things either work, or fail
in an obvious way.

I don't have a need for --inplace.  So someone who does have a need,
should say if my reasoning makes sense in the real world.

For databases, I'd want the --inplace option to rename the file before it
starts changing it, and then rename it back.  With that mode, rsync would
ensure that every file is in one of three states:

1. original
2. new
3. gone (but not forgotten, since the intermediate state file
 would still be there, just with a temporary name)

You wouldn't want the rename in a bunch of situations:
  on systems where renames are expensive
  where it doesn't hurt to have a mixture of old and new contents in a file
  for raw devices, unless you know how /dev really works on your system

This means, that there should either be two --inplace  options, like
 --inplace-rename and --inplace-norename or maybe, --inplace-dangerous and
 --inplace-more-dangerous or a modifier option.  In many cases, writing
the documentation, is more work than writing the code.

--
Paul Haas [EMAIL PROTECTED]


-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Mike Rubel

On Wed, 5 Feb 2003, Ben Escoto wrote:

  CB == Craig Barratt [EMAIL PROTECTED]
  wrote the following on Wed, 05 Feb 2003 04:41:22 -0800
 
   CB Of course, a major issue with --inplace is that the file will be
   CB in an intermediate state if rsync is killed mid-transfer.  Rsync
   CB currently ensures that every file is either the original or new.
 
 I'm curious, how does it ensure this?

During the copy, rsync writes to a temporary file in the same directory
(the temp file is hidden; it starts with a .).  Then, once the transfer
is done, it mv's that temp file over the original.  My understanding is
that mv is atomic under unix, so this action either happens in its
entirety or not at all.

Mike

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Bennett Todd
2003-02-05T07:41:22 Craig Barratt:
 The trick is that when --inplace is specified the block matching
 algorithm (on the sender) would only match blocks at or after that
 block's location (on the receiver).

... and only when the source block in question remains unchanged in
the new file?

 No protocol change is required.

But some serious cleverness, resulting in lost opportunities for
reusing blocks (where the block is no longer available by the time
the receiver gets to that point in the file).

-Bennett



msg06457/pgp0.pgp
Description: PGP signature
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread Ben Escoto
On Wed, Feb 05, 2003 at 09:17:03AM -0800, Mike Rubel wrote:
CB Of course, a major issue with --inplace is that the file will be
CB in an intermediate state if rsync is killed mid-transfer.  Rsync
CB currently ensures that every file is either the original or new.
  
  I'm curious, how does it ensure this?
 
 During the copy, rsync writes to a temporary file in the same directory
 (the temp file is hidden; it starts with a .).  Then, once the transfer
 is done, it mv's that temp file over the original.  My understanding is
 that mv is atomic under unix, so this action either happens in its
 entirety or not at all.

Even if mv is atomic, how does rsync make sure that the move doesn't
happen before the last of the data is written to the tempfile?  Does
it explicity fsync the tempfile, or is there some other way it knows?


-- 
Ben Escoto



msg06459/pgp0.pgp
Description: PGP signature
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-05 Thread jw schultz
On Wed, Feb 05, 2003 at 10:52:45AM -0800, Ben Escoto wrote:
 On Wed, Feb 05, 2003 at 09:17:03AM -0800, Mike Rubel wrote:
 CB Of course, a major issue with --inplace is that the file will be
 CB in an intermediate state if rsync is killed mid-transfer.  Rsync
 CB currently ensures that every file is either the original or new.
   
   I'm curious, how does it ensure this?
  
  During the copy, rsync writes to a temporary file in the same directory
  (the temp file is hidden; it starts with a .).  Then, once the transfer
  is done, it mv's that temp file over the original.  My understanding is
  that mv is atomic under unix, so this action either happens in its
  entirety or not at all.
 
 Even if mv is atomic, how does rsync make sure that the move doesn't
 happen before the last of the data is written to the tempfile?  Does
 it explicity fsync the tempfile, or is there some other way it knows?

Rsync doesn't fsync nor open the files O_SYNC so IF your
system crashed before flushing it is possible that the file
could be in a corrupted state on reboot.  If your system
crashes in the middle of an rsync you probably would want to
rerun the rsync anyway.  After all unless you are only
syncing one file you probably would have an inconsistant
tree (some files up-to-date, some not).

Adding fsync or O_SYNC would not exactly help performance
much either.



-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04 Thread Bennett Todd
2003-02-04T14:29:48 Kenny Gorman:
 Is it possible to tell rsync to update the blocks of the target file 
 'in-place' without creating the temp file (the 'dot file')?  I can 
 guarantee that no other operations are being performed on the file at 
 the same time.  The docs don't seem to indicate such an option.

No, it's not possible, and making it possible would require a deep
and fundamental redesign and re-implementation of rsync; the result
wouldn't resemble the current program much.

Here's a sketch of the heart of the rsync algorithm (for finer
details, see the tech report available from[1]).

Let's call the two endpoints the sender (who has the newer version
of the file), and the receiver (who wants to update its older local
copy to match that on the sender).

The receiver computes checksums on each block of the destination
file, and streams them to the sender.

The sender finds all instances of any of those blocks in the source
file. Then the sender transmits instructions to the receiver,
describing how to build a spiffy new copy of the newer source file,
using a mixture of actual chunks of new contents, and blocks taken
from the older version of the file. The receiver follows these
instructions, copying blocks as needed from the old version and
combining them with the new bits to construct the new file. It's
then moved into place.

This algorithm by nature expects that the old version of the
destination file is used as a source for taking blocks, in building
the new version. Adjusting this algorithm to work in-place is
non-trivial.

-Bennett

[1] URL:http://samba.anu.edu.au/rsync/tech_report/



msg06429/pgp0.pgp
Description: PGP signature
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04 Thread jw schultz
On Tue, Feb 04, 2003 at 02:37:26PM -0500, Bennett Todd wrote:
 2003-02-04T14:29:48 Kenny Gorman:
  Is it possible to tell rsync to update the blocks of the target file 
  'in-place' without creating the temp file (the 'dot file')?  I can 
  guarantee that no other operations are being performed on the file at 
  the same time.  The docs don't seem to indicate such an option.
 
 No, it's not possible, and making it possible would require a deep
 and fundamental redesign and re-implementation of rsync; the result
 wouldn't resemble the current program much.
 
 Here's a sketch of the heart of the rsync algorithm (for finer
 details, see the tech report available from[1]).
 
 Let's call the two endpoints the sender (who has the newer version
 of the file), and the receiver (who wants to update its older local
 copy to match that on the sender).
 
 The receiver computes checksums on each block of the destination
 file, and streams them to the sender.
 
 The sender finds all instances of any of those blocks in the source
 file. Then the sender transmits instructions to the receiver,
 describing how to build a spiffy new copy of the newer source file,
 using a mixture of actual chunks of new contents, and blocks taken
 from the older version of the file. The receiver follows these
 instructions, copying blocks as needed from the old version and
 combining them with the new bits to construct the new file. It's
 then moved into place.
 
 This algorithm by nature expects that the old version of the
 destination file is used as a source for taking blocks, in building
 the new version. Adjusting this algorithm to work in-place is
 non-trivial.

The reason why in-place updating is difficult is that
rsync expects the unchanged blocks in the old file may be
relocated.  Data inserted into or removed from the file does
not require the rest of the file to be retransmitted.
Unchanged blocks will be copied from the old locations in
the old file to new locations in the new file.

In-place updates requires that blocks not relocate.
It may be possible by disallowing matches having differing
offsets.  That would require deeper investigation.


-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04 Thread Martin Pool
On  4 Feb 2003, jw schultz [EMAIL PROTECTED] wrote:

 The reason why in-place updating is difficult is that
 rsync expects the unchanged blocks in the old file may be
 relocated.  Data inserted into or removed from the file does
 not require the rest of the file to be retransmitted.
 Unchanged blocks will be copied from the old locations in
 the old file to new locations in the new file.
 
 In-place updates requires that blocks not relocate.
 It may be possible by disallowing matches having differing
 offsets.  That would require deeper investigation.

Of course the other place where people want this is for transfers of
block devices, where the rename is just not possible.

I looked a little at doing this in librsync.  The naive solution is to
merely prohibit the delta from referring to blocks that have been
already overwritten.  I will probably eventually add at least this
option.

You might try this in rsync.  A lot of other code to do with
e.g. setting permissions makes the assumption of the rename model,
though.  It would take a fair amount of testing.

Of course this model really falls down in some cases.  Consider the
case of one block inserted at the beginning.  Then with the naive no
backreferences approach every block will be overwritten just before
it's needed. :( 

You can imagine a smarter algorithm that does non-sequential writes to
the output so as to avoid writing over blocks that will be needed
later.  Alternatively, if you assume some amount of temporary storage,
then it might be possible to still produce output as a stream.

Really for your problem the practical solution is just to dump the
whole file, perhaps allowing for sparse blocks.  As other people have
observed, by design rsync does a lot more disk IO than network.

-- 
Martin 
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04 Thread jw schultz
On Wed, Feb 05, 2003 at 12:47:49PM +1100, Martin Pool wrote:
 On  4 Feb 2003, jw schultz [EMAIL PROTECTED] wrote:
 
  The reason why in-place updating is difficult is that
  rsync expects the unchanged blocks in the old file may be
  relocated.  Data inserted into or removed from the file does
  not require the rest of the file to be retransmitted.
  Unchanged blocks will be copied from the old locations in
  the old file to new locations in the new file.
  
  In-place updates requires that blocks not relocate.
  It may be possible by disallowing matches having differing
  offsets.  That would require deeper investigation.
 
 Of course the other place where people want this is for transfers of
 block devices, where the rename is just not possible.
 
 I looked a little at doing this in librsync.  The naive solution is to
 merely prohibit the delta from referring to blocks that have been
 already overwritten.  I will probably eventually add at least this
 option.
 
 You might try this in rsync.  A lot of other code to do with
 e.g. setting permissions makes the assumption of the rename model,
 though.  It would take a fair amount of testing.

I certainly am not interested in coding it.  Too small a
target use.

 Of course this model really falls down in some cases.  Consider the
 case of one block inserted at the beginning.  Then with the naive no
 backreferences approach every block will be overwritten just before
 it's needed. :( 

I was thinking more in terms of no block relocation at all.
Checksums only match if at the same offset.  The receiver simply
discards (or never gets) info about blocks that are
unchanged.  It would just lseek and write with a possible
truncate at the end.

 You can imagine a smarter algorithm that does non-sequential writes to
 the output so as to avoid writing over blocks that will be needed
 later.  Alternatively, if you assume some amount of temporary storage,
 then it might be possible to still produce output as a stream.

I really doubt it is worthwhile doing to rsync.  This
principly applies to block oriented files such as devices
and database files.  For the most part rsync handles these
fine.

If someone really does feel they must have this i'd suggest
creating a different tool just for this job.  It could
operate on just one file at a time and either be smart
enough to suss the optimal block size by knowing the file
type or accept a block size from the command-line.  Block
sizes for this would clearly be of the power-of-two variety.

-- 

J.W. SchultzPegasystems Technologies
email address:  [EMAIL PROTECTED]

Remember Cernan and Schmitt
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html



Re: rsync in-place (was Re: rsync 1tb+ each day)

2003-02-04 Thread Eric Whiting
jw schultz wrote:
 
 I was thinking more in terms of no block relocation at all.
 Checksums only match if at the same offset.  The receiver simply
 discards (or never gets) info about blocks that are
 unchanged.  It would just lseek and write with a possible
 truncate at the end.

This would seem to help a lot on larger database files. Why look at a
700 byte block of data from a source file and try to find a matching
block by fully scanning block checksums at all offsets in a 8G
destination datafile? And then doing it again for every 700 bytes? (I
read the rsync technical paper -- but I might be confused) 

In the case of Oracle data files the only place a meaningful/syncable
delta will occur is at the same offset. Yes this is a special case --
but it has the potential to really help in rsyncing oracle datafiles
during a hotbackup or when syncing from a snapshot to nearstore storage.
This approach should be faster than the -W option for very large Oracle
datafiles (which often have small amounts of changed blocks). It should
also be faster than deleting the destination files and resending (-W)
like has been suggested. 

  You can imagine a smarter algorithm that does non-sequential writes to
  the output so as to avoid writing over blocks that will be needed
  later.  Alternatively, if you assume some amount of temporary storage,
  then it might be possible to still produce output as a stream.
 
 I really doubt it is worthwhile doing to rsync.  This
 principly applies to block oriented files such as devices
 and database files.  For the most part rsync handles these
 fine.

agreed. 

The original post still raises an interesting issue -- it should not be
faster to remove destination files before running rsync. That is counter
to one of the main purposes of rsync  -- efficiently detect and send
only the deltas. 

eric
-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html