Hi,

Les Mikesell wrote on 16.05.2007 at 13:55:04 [Re: [BackupPC-users] Advice on 
BackupPC]:
> Vetch wrote:
> > I have a two site network [...] Our bandwidth is limited [...]
> > I want to backup my data from one site to the other...
> > In order to assess whether that would be do-able, I went to an 
> > exhibition of backup technologies.
> > One that caught my eye was a company called Data Domain, who claimed to 
> > de-duplicate data at the block level of 16KB chunks...
> > Apparently, all they send are the changed chunks and the schema to 
> > retrieve the data.
> 
> Backuppc can use rsync to transfer the data.  Rsync works by reading 
> through the file at both ends, exchanging block checksums to find the 
> changed parts.

the important part about this is that rsync compares a file with the version
in the reference backup (last incremental of lower level or full backup).
Consequentially, a new file will be transfered in full even if an identical
file exists in the pool. De-duplication happens on the file level after
transfer.

As far as I know, rsync uses 2KB chunks of the file, so you may need to
transfer less data in some cases than with 16KB chunks. On the other hand,
more checksums will need to be transfered in the general case. rsync
incremental backups take file attributes into account (modification time,
permissions etc.) and only transfer apparently changed files, using block
checksums as with full backups.

> > Does it send the changed data down the line and then check to see if it 
> > already has a copy, or does it check then send?

In general, it sends data and then checks (on-the-fly, without creating a
temporary copy for existing files). With rsync, it is possible to cut down
bandwidth requirements by comparing against the previous version of the
respective file.

> > The other thing is, can BackupPC de-duplicate at the block level or is 
> > it just file level?
> > I'm thinking that block level might save considerable amounts of 
> > traffic, because we will need to send file dumps of Exchange databases 
> > over the wire...
> > ... Which I assume will mean that we've got about 16GB at least to copy 
> > everyday, since it'll be creating a new file daily...

File level. That means you'll have a new file every day. Unless you happen
to have other files with identical contents, pooling won't gain you anything
for these files, though compression might.

> > On the other hand, would 16KB blocks be duplicated that regularly - I 
> > imagine there is a fair amount of variability in 16KB of ones and zeros, 
> > and the chances of them randomly reoccurring without being part of the 
> > same file, I would say are slim...

Well, for your database dumps, that would be sufficient, wouldn't it? If
you've got multiple copies of a 16GB database file and each differs only by
a few MB, that would leave a lot of identical blocks.

Considering we're talking about a M|([EMAIL PROTECTED] product, I wouldn't bet 
on the
dump format being especially convenient, though. They've probably got a
variable length header format just for the sake of defeating block-level
de-duplication strategies :-).

> > What do you think?
> 
> I think rsync will do it as well as it can be done.

For the transfer: yes - if the database dumps are always stored in the same
file. If you have a new file name each day (including the date, for
instance), then rsync won't help you at all.
For storage, the transfer method is irrelevant.

> You can test the transfer efficiency locally first to get an idea of how 
> well the common blocks are handled.

Correct. You can do this for single files (database dumps) or the whole file
tree you want to back up. For your database dumps, rsync should also give
you a hint, how much savings block-level de-duplication could gain you. If
rsync can't speed up the transfer, de-duplication likely won't save any disk
space.


BackupPC is not difficult to set up. You could simply test how well it works
for you before deciding to spend money on a commercial product. BackupPC has
its limits which may make a commercial product the better choice for you.
But then, the commercial product probably also has its limits, and the
question is whether they are so well documented. If it's only the block-level
de-duplication, disk space might be cheaper than software.

Regards,
Holger

P.S.: For LVM snapshots, the problem is also that de-duplication take place
      at file level.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Reply via email to