David Brown writes:

> I've been using backuppc for several days, and I really like the concept
> behind it.  The web interface is very helpful.  However, I'm having a very
> hard time figuring out what to store the backup filesystem on.
> 
> I've tried both XFS and ReiserFS, and both have utterly abysmal performance
> in the backup tree.  The problem has to do with the hardlinked trees.
> 
>   - Most filesystems optimize directories by using inodes that are stored
>     near one another for files in the same directory.  This allows access
>     to files in the same directory to be localized on the disk.
> 
>   - BackupPC creates the files in the backup directory, and then hardlinks
>     them, by hash, into the pool.  This means that each of the entries in
>     a pool directory has an inode (and data) on a diverse part of the disk.
>     Just statting the files in a pool directory is very slow.  'du' of the
>     pool directory takes several hours on any filesystem I've tried it on.
> 
>   - Other than the first backup directory, the backup directories aren't
>     much better, since most of the files are hardlinks back to the pool.

You're exactly right.  A major performance limitation of BackupPC
is that backup directories tend to have widely dispersed inodes.
Yes, just stat()ing files in a single directory involves lots of
disk seeks.

A custom BackupPCd client is being developed, and once it is
ready I'm curious to see if sorting readdir contents by inode
number on the server will help the performance.

> So my question is twofold:
> 
>   - Is anyone aware of a Linux filesystem that can handle this kind of
>     usage behavior without massive thrashing?
> 
>   - How difficult would it be to change the way that backups are done?
>     Instead of hardlinking everything, keep the backup trees as a virtual
>     concept.  The result could be stored either in some kind of database,
>     or even just a series of indexed flat files.  If properly built and
>     indexed, these should be searchable just as easily as the tree.  In
>     fact, the browser for restore can't look at the trees exclusively,
>     anyway, because of incremental backups.
> 
>     If the pool files were created in the proper place initially (which,
>     BTW, means that they can't be created first, and then moved into place.
>     The checksum has to be known before the file is even initially
>     created).
> 
> I guess I'll spend some time studying the code to see if this kind of
> concept is even plausable with the current code.

The biggest issue is maintaining accurate reference counts so you
know when to delete unused pool files.  The hardlink structure is
using the file system to maintain reference counts.

There has been some consideration of using an RDBMS to maintain
the reference counts, but no benchmarking has been done.  The
table sizes will grow to be very large, and it's hard to imagine
that any less disk seeks will be needed by the RDBMS since it
certainly won't fit in memory.

But the concept is worthy of further consideration.

Craig


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Reply via email to