Here is how I think an algorithm could work, which requires no
changes on the remote side, so it should work with the plain rsync
protocol, but likely will require some changes to rsyncp.

Let:
  B           be the BackupPC_dump process
  P           be the peer (PC to be backed up)
  R           be the remote rsync process
  F           be a file on P that backuppc wants to save

  Gc, Gc2    be two compressed files in the same chain in the server pool
  G1, G2      be their uncompressed data
  F[x], G1[x] be the xth block of F or G1 respectively

  HR(b)     be the rsync rdelta-hash of block b
  HL(b)     be backuppc's hash-for-pooling of block b



0. B has already contacted P and R has been spawned. Maybe some
   files have already been transferred;
   Now B arrives at F and determines that it need to transfer the
   file (i.e. it's new, or the attributes have changed).

1. Since we need the first 8 blocks to determine the pool chain, we
   just receive the first 8 blocks. Depending on the protocol, this
   possibly requires us to ask for the hashes first, and to pretend
   that each hash simply does not match out (fake) local hashes:

    newfile = [];   // array of blocks

    for( x=0 ; x < 8 ; ++x ) {

      tmphash = recv( HR( F[x] ));
      if tmphash == IMPOSSIBLE_HASH { explode(); }
      newfile += recv( F[x] );
    }

2. Okay, now we have the first 8 blocks and can determine the pool
   chain, if there is one:

    localfiles = []
    localfiles = get_pool_files( newfile[0], newfile[7] )

3. And now iteratively ask for hashsums from the remote, compare
   them to the corresponding blocks in the pool files (which may be
   cached!), and assemble the new file:

    x = 8
    while !F.eof() {

      tmphash = recv( HR( F[x] ));
      tmpblock = "";

      for G in localfiles {  // iterate G1, G2, …
        if tmphash == HR( G[x] ) {
          tmpblock = G[x];
          break;
          /* possible optimisation: replace localfiles with just
           * localfile so that for successive blocks, not all local
           * files have to be compared.
           */
        }
      }

      if !tmpblock {
        tmpblock = recv( F[x] );
      }

      newfile += tmpblock;

    }

4. At the very end, determine if the new data is identical to
   existing data and hardlink. This can either be done with
   a full-file hashsum (to be safe), or the above algorithm can
   determine that while it's running.



If this logic cannot be implemented inside or atop of rsyncp, then
maybe the following logic would work:

1. Create a buffer in memory and fill it with zeroes;

2. Pass this buffer to rsyncp and ask it to synchronise F into it;

3. After 8×128k blocks have been determined:

   a. determine if there is a corresponding pool file;

   b. if yes, uncompress it and write it into the buffer;

4. Let rsyncp continue, with a potentially modified buffer.


Thoughts?

-- 
martin | http://madduck.net/ | http://two.sentenc.es/
 
"i've got a bike,
 you can ride it if you like,
 it's got a basket. a bell that rings, and things to make it look good.
 i'd give it to you if i could,
 but i borrowed it."
                                                  -- syd barrett, 1967
 
spamtraps: madduck.bo...@madduck.net

Attachment: digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to