Re: Problems with making hardlink-based backups

David Mon, 17 Aug 2009 01:59:44 -0700

Thanks for the replies.

On Fri, Aug 14, 2009 at 5:05 PM, Andrew
Sackville-West<and...@farwestbilliards.com> wrote:
>>
>> du worked pretty well with rdiff-backup, but is very problematic with
>> a large number of hardlink-based snapshots, which each have a complete
>> "copy" of a massive filesystem (rather than just info on which files
>> changed).
>
> but they're not copies, they're hardlinks. I guess I don't understand
> the problem.
> [...]

I understand that (actually it's the whole point of using a
hardlinked-based snapshot system :-)). What I meant by "massive
filesystem" was a huge number of additional file entries, that gets
created for every snapshot. That causes major problems for utilities
like du and locate that need to walk the entire filesystem. That's why
I've been making scripts to "prune" files, so that there are fewer
filesystem entries to be walked by those tools.

>
> If you are using hardlinks, and nice discrete directories for each
> machine, then a machine that has infrequent changes will not use a lot
> of space because the files don't change.
> [...]

Thanks, also understand that. My problem however, is that all of the
backups (for all the servers) are on a single LVM partition. But when
the LVM is full, then I need to run a tool like 'du' to check where
space can be reclaimed. That's no longer working nicely (takes days to
finish, and makes huge, multi-gb output files). From experience I know
which servers are more likely to have the most disk usage "churn", and
I've been removing those one's oldest entries recently, to recover
space, but I'd like to also be able to run 'du' effectively, rather
than relying on hunches.

>
> you should be able to look at the difference between disk usage over
> different time periods and figure out your "burn rate".
> [...]

Thanks for those ideas also (also considered this). However this still
doesn't let me to use tools like du and locate nicely, due to the huge
number of filesystem entries (and I really want to be able to use
those tools, or at least du, to be able to actually check where
harddrive space is being used). Again, that's why I'm having to
consider pruning-type approaches (which seem like an awful hack, but
I'm not sure of a better method at this time).

> I suspect I'm telling you stuff you already know and I apologize if I
> appear condescending. The odds are you probably know more about
> backups than I do. hth.

Nah, it's fine :-). Better too many ideas, rather than assuming I'm
aware of all the possible options, and leaving out something that
might have been useful.

On Fri, Aug 14, 2009 at 5:49 PM, Alan
Chandler<a...@chandlerfamily.org.uk> wrote:
> Andrew Sackville-West wrote:
> I'm not sure I understood what you are after either.  Admittedly on a rather
> small home server, I use the cp -alf command to have only changed files kept
> for a long time

Thanks for those cron and script entries. I guess I could use
something like that too (have X dailies, Y weeklies, Z monthlies,
etc), and it would save more harddrive space. But actually, managing
generations of backups to conserve harddrive space (and still have
some really old backups) isn't really the problem. The problem is:

1) Source servers (fileservers, etc) have millions+ files. There are a
couple of servers like this.

2) Hardlink snapshot process makes a duplicate of the filesystem
structure (for each of the above servers) each time on the backup
server.

3) Backup server ends up with an exponentially large number of files,
compared to any of the servers actually being backed up

The filesystem can support (3), but utilities like 'du' and 'updatedb'
become almost unusable. That's the main problem.

Basically, the problem isn't that I don't know how to use rsync, cp,
etc to make the backups, manage generations,  etc... the problem is an
incredibly large filesystem (as in number of hardlinks, and actual
directories, to a smaller extent), resulting from the hardlink
snapshot-based approach (as opposed to something like rdiff-backup,
which only stores the differences between the generations).

On Sat, Aug 15, 2009 at 4:35 AM, Rob Owens<row...@ptd.net> wrote:
>>
> You might want to check out BackupPC.  It uses hardlinks and
> compression.  It's got a web-based GUI which makes it pretty easy to
> find statistics on disk space used, on a per-server basis.
>

I've been researching the various other, more integrated backup
solutions (amanda, bacula, etc), but I have two main problems with
them:

1) They are too over-engineered/complex for my liking, and the docs
are hard to understand. I prefer simple command-line tools like rsync,
etc, which I can script. Also don't really want to have to install
their special backup tool-specific services everywhere on the network
if I can avoid it.

2) I can't find information on how most of them actually store their
backed up data. So they could very well have either the same problem,
or have other issues that I become unable to work around if I want to
use that tool.

Thanks for your backuppc suggestion. I have heard of it before (but
bypassed it in favor of rdiff-backup at the time), but I'll give it a
closer look this time. The features page looks promising:

http://backuppc.sourceforge.net/info.html

-David

-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Re: Problems with making hardlink-based backups

Reply via email to