Problems with making hardlink-based backups

2009-08-13 Thread David
Hi list.

Until recently I was using rdiff-backup for backing up our servers.
But after a certain point, the rdiff-backup process would take days,
use up huge amounts of CPU and RAM, and the debug logs were very
opaque. And the rdiff-backup mailing lists weren't very helpful
either.

So, for servers where that's a problem, I changed over to a hardlink
snapshot-based aproach.

That's been working fine for the past few weeks. But now, the backup
server is running out of harddrive space and we need to check which
backup histories is taking up the most space, so we can prune the
oldest week or two from them.

The problem now is, that 'du' takes days to run, and now generates 4-5
GB text files. This is of course caused by du having to now process
dozens of additional snapshot directories, many of them with a large
number of files.

What I've been doing, is writing helper scripts to help prune the
earlier directories. Something like this:

1) Compare files under snapshot1 & snapshot2. If any files under
snapshot1 are hardlinks to the same files under snapshot2, then remove
them from snapshot1, and add an entry to a text file (for possible
later regeneration).

2) Remove empty directories (and add their details to a text file)

3) 7zip-compress the text files containing recovery info.

4) Possibly later (before empty directories): Remove hardlinks and
symlinks from snapshot dirs, and add them to compressed text files.

There's also scripts to reverse the above operations. Of course, it
takes a while to get a complete snapshot of a given server from a few
weeks back, but at least it's possible.

I'm thinking that this is a lot of work, and there must be better ways
of handling this kind of problem. I don't like reinventing the wheel.

Given the above, are there any suggestions? eg:

1) Another tool similar to rdiff-backup, which has easier-to-understand logs.
2) A quicker way of running DU for my use case (huge number of
hardlinks in directories)
3) Existing tools for managing hardlink-based snapshot directories
etc.

Thanks in advance.

David.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-13 Thread Andrew Sackville-West
On Thu, Aug 13, 2009 at 09:20:17AM +0200, David wrote:
> Hi list.
[...]
> 3) Existing tools for managing hardlink-based snapshot directories
> etc.

maybe rsnapshot is what you're after. It does hardlinked snapshots
with automagical deletion of older backups and configurable frequency
etc. I quite like it, though I'm not using it for high-volume stuff.

Once little caveat that always seems to get me: the daily won't run
until you've completed enough hourlies, the weekly won't run until
you've completed a week's worth of dailies, etc. Very disconcerting
the first few days of use.

A


signature.asc
Description: Digital signature


Re: Problems with making hardlink-based backups

2009-08-13 Thread David
Thanks for your suggestion, and I have heard of rsnapshot.

Although, actually removing older snapshot directories isn't really the problem.

The problem is, if you have a large number of such backups (perhaps
one per server), then finding out where harddrive space is actually
being used, is problematic (when your backup server starts running low
on disk space).

du worked pretty well with rdiff-backup, but is very problematic with
a large number of hardlink-based snapshots, which each have a complete
"copy" of a massive filesystem (rather than just info on which files
changed).

I guess I could do something like removing the oldest snapshot
directories from *all* the backups, until there is enough free space.
But that's kind of wasteful. Like, if I have one server that didn't
change much over 2 years, then I can only keep eg the last 2-3 weeks
of backups, because there is another server that has a huge amount of
file changes in the same period. And not being able to use "du" is
kind of annoying (actually, "locate" is also having major problems, so
I disabled it on the backup server).

That's why I started working on a set of pruning/unpruning scripts,
which basically "move" redundant info (the vast majority) over into
compressed files (with ability to move out again later). Kind of like
moving the snapshot-based approach closer to how rdiff-backup works
(but, not chewing up huge amounts of ram and being hard to diagnose).
That way admins can in theory more easily check where space is being
used (but at the cost of not having quick access to earlier complete
server snapshots).

But I assume there must be better existing ways of handling this kind
of problem, since backups aren't exactly something new.

On Thu, Aug 13, 2009 at 5:48 PM, Andrew
Sackville-West wrote:
> On Thu, Aug 13, 2009 at 09:20:17AM +0200, David wrote:
>> Hi list.
> [...]
>> 3) Existing tools for managing hardlink-based snapshot directories
>> etc.
>
> maybe rsnapshot is what you're after. It does hardlinked snapshots
> with automagical deletion of older backups and configurable frequency
> etc. I quite like it, though I'm not using it for high-volume stuff.
>
> Once little caveat that always seems to get me: the daily won't run
> until you've completed enough hourlies, the weekly won't run until
> you've completed a week's worth of dailies, etc. Very disconcerting
> the first few days of use.
>
> A
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkqENcwACgkQaIeIEqwil4bCHQCeLWJ+9UcjtYqyolT6kiK7kDLy
> R20Aniawf/KsnU2uEG7D+35DjoksUJgS
> =qhWD
> -END PGP SIGNATURE-
>
>


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-13 Thread David
Btw, sorry for top-posting. I don't use mailing lists very often and
forgot about the convention.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-14 Thread Andrew Sackville-West
On Fri, Aug 14, 2009 at 08:43:32AM +0200, David wrote:
> Thanks for your suggestion, and I have heard of rsnapshot.
> 
> Although, actually removing older snapshot directories isn't really the 
> problem.
> 
> The problem is, if you have a large number of such backups (perhaps
> one per server), then finding out where harddrive space is actually
> being used, is problematic (when your backup server starts running low
> on disk space).

keep each server's backup in a distinctly separate location. That
should make it clear which machines are burning up space.

> 
> du worked pretty well with rdiff-backup, but is very problematic with
> a large number of hardlink-based snapshots, which each have a complete
> "copy" of a massive filesystem (rather than just info on which files
> changed).

but they're not copies, they're hardlinks. I guess I don't understand
the problem. In a scheme like that used by rsnapshot, a file is only
*copied* once. If it remains unchanged then the subsequent backup
directories only carry a hardlink to the file. When older backups are
deleted, the hardlinks keep the file around, but no extra room is
used. There are only *pointers* to the file lying around. Then when
the file changes, a new copy will be made and subsequent backups will
hardlink to the new file. Now you'll be using the space of two files
with different sets of hardlinks pointing to them. (I'm sure you know
this, just making sure we are on common ground).

> 
> I guess I could do something like removing the oldest snapshot
> directories from *all* the backups, until there is enough free space.
> But that's kind of wasteful. Like, if I have one server that didn't
> change much over 2 years, then I can only keep eg the last 2-3 weeks
> of backups, because there is another server that has a huge amount of
> file changes in the same period. And not being able to use "du" is
> kind of annoying (actually, "locate" is also having major problems, so
> I disabled it on the backup server).

If you are using hardlinks, and nice discrete directories for each
machine, then a machine that has infrequent changes will not use a lot
of space because the files don't change. Other than the minimal space
used by the hardlinks themselves, you could save a *lot* of "backups"
of an unchanged file and use the same space as the one file because
there is only one actual copy of the file.

That said, the more often you backup rapidly changing data, the bigger
the backup gets because you store complete copies for each change. You
have to balance the needs of each machine (and probably have a
different scheme for each machine). How important is it to have access
to a specific change in a file? And for how long do you need access to
that specific change? These sorts of questions should help with these
decisions.


> 
> That's why I started working on a set of pruning/unpruning scripts,
> which basically "move" redundant info (the vast majority) over into
> compressed files (with ability to move out again later). Kind of like
> moving the snapshot-based approach closer to how rdiff-backup works
> (but, not chewing up huge amounts of ram and being hard to diagnose).
> That way admins can in theory more easily check where space is being
> used (but at the cost of not having quick access to earlier complete
> server snapshots).

you should be able to look at the difference between disk usage over
different time periods and figure out your "burn rate".  And using a
hardlink approach, you can easily archive older backups and then
remove them without laborious pruning. This is because if you delete a
file that haas multiple hard links to it, the file will still exist
until *all* the hardlinks are gone. So to remove a snapshot from
lastweek that contains files that haven't changed, you just remove
it. The files that you still need will still be there because you're
hardlinked to them. 

> 
> But I assume there must be better existing ways of handling this kind
> of problem, since backups aren't exactly something new.

I suspect I'm telling you stuff you already know and I apologize if I
appear condescending. The odds are you probably know more about
backups than I do. hth.

A

> 
> On Thu, Aug 13, 2009 at 5:48 PM, Andrew
> Sackville-West wrote:
> > On Thu, Aug 13, 2009 at 09:20:17AM +0200, David wrote:
> >> Hi list.
> > [...]
> >> 3) Existing tools for managing hardlink-based snapshot directories
> >> etc.
> >
> > maybe rsnapshot is what you're after. It does hardlinked snapshots
> > with automagical deletion of older backups and configurable frequency
> > etc. I quite like it, though I'm not using it for high-volume stuff.
> >
> > Once little caveat that always seems to get me: the daily won't run
> > until you've completed enough hourlies, the weekly won't run until
> > you've completed a week's worth of dailies, etc. Very disconcerting
> > the first few days of use.
> >
> > A
> >
> > -BEGIN PGP SIGNATURE-
> > Version: GnuPG v1.4.9 (GNU/Linux)
> >
>

Re: Problems with making hardlink-based backups

2009-08-14 Thread Alan Chandler

Andrew Sackville-West wrote:

On Fri, Aug 14, 2009 at 08:43:32AM +0200, David wrote:

Thanks for your suggestion, and I have heard of rsnapshot.

Although, actually removing older snapshot directories isn't really the problem.

The problem is, if you have a large number of such backups (perhaps
one per server), then finding out where harddrive space is actually
being used, is problematic (when your backup server starts running low
on disk space).


keep each server's backup in a distinctly separate location. That
should make it clear which machines are burning up space.


du worked pretty well with rdiff-backup, but is very problematic with
a large number of hardlink-based snapshots, which each have a complete
"copy" of a massive filesystem (rather than just info on which files
changed).


but they're not copies, they're hardlinks. I guess I don't understand
the problem. In a scheme like that used by rsnapshot, a file is only
*copied* once. If it remains unchanged then the subsequent backup
directories only carry a hardlink to the file. When older backups are
deleted, the hardlinks keep the file around, but no extra room is
used. There are only *pointers* to the file lying around. Then when
the file changes, a new copy will be made and subsequent backups will
hardlink to the new file. Now you'll be using the space of two files
with different sets of hardlinks pointing to them. (I'm sure you know
this, just making sure we are on common ground).



I'm not sure I understood what you are after either.  Admittedly on a 
rather small home server, I use the cp -alf command to have only changed 
files kept for a long time


this is cron.daily backup - I have cron.weekly and cron.monthly similar 
to this



if [ -d $ARCH/daily.6 ] ; then
if [ ! -d $ARCH/weekly.1 ] ; then mkdir -p $ARCH/weekly.1 ; fi
# Now merge in stuff here with what might already be there using hard links
cp -alf $ARCH/daily.6/* $ARCH/weekly.1
# Finally loose the rest
rm -rf $ARCH/daily.6 ;

fi
# Shift along snapshots
if [ -d $ARCH/daily.5 ] ; then mv $ARCH/daily.5 $ARCH/daily.6 ; fi
if [ -d $ARCH/daily.4 ] ; then mv $ARCH/daily.4 $ARCH/daily.5 ; fi
if [ -d $ARCH/daily.3 ] ; then mv $ARCH/daily.3 $ARCH/daily.4 ; fi
if [ -d $ARCH/daily.2 ] ; then mv $ARCH/daily.2 $ARCH/daily.3 ; fi
if [ -d $ARCH/daily.1 ] ; then mv $ARCH/daily.1 $ARCH/daily.2 ; fi
if [ -d $ARCH/snap ] ; then mv $ARCH/snap $ARCH/daily.1 ; fi

# Collect new snapshot archive stuff doing daily backup on the way

mkdir -p $ARCH/snap


leading to daily backups for a week, weekly backups for a month, monthly 
backups until I archive them into a long term store (write a DVD - 
although hearing stories about issues with even these it might be easier 
to leave in the disk).




CDARCH=/bak/archive/CDarch-`date +%Y`


if [ -d $ARCH/monthly.6 ] ; then

if [ ! -d $CDARCH ] ; then mkdir -p $CDARCH ; fi
cp -alf $ARCH/monthly.6/* $CDARCH

rm -rf $ARCH/monthly.6
fi


The backup process uses something like the following to keep an initial 
backup and save any changed file into this long term storage.  This is 
just one part of the backup - other machines and other file systems use 
a similar mechanism with just the parameters changed.


rsync -aHqz --delete --backup --backup-dir =$ARCH/snap/freeswitch/ 
$MACH::freeswitch/ /bak/freeswitch/



--
Alan Chandler
http://www.chandlerfamily.org.uk


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org




Re: Problems with making hardlink-based backups

2009-08-14 Thread Rob Owens
On Thu, Aug 13, 2009 at 09:20:17AM +0200, David wrote:
> Hi list.
> 
> Until recently I was using rdiff-backup for backing up our servers.
> But after a certain point, the rdiff-backup process would take days,
> use up huge amounts of CPU and RAM, and the debug logs were very
> opaque. And the rdiff-backup mailing lists weren't very helpful
> either.
> 
> So, for servers where that's a problem, I changed over to a hardlink
> snapshot-based aproach.
> 
> That's been working fine for the past few weeks. But now, the backup
> server is running out of harddrive space and we need to check which
> backup histories is taking up the most space, so we can prune the
> oldest week or two from them.
> 
You might want to check out BackupPC.  It uses hardlinks and
compression.  It's got a web-based GUI which makes it pretty easy to
find statistics on disk space used, on a per-server basis.

-Rob


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-17 Thread David
Thanks for the replies.

On Fri, Aug 14, 2009 at 5:05 PM, Andrew
Sackville-West wrote:
>>
>> du worked pretty well with rdiff-backup, but is very problematic with
>> a large number of hardlink-based snapshots, which each have a complete
>> "copy" of a massive filesystem (rather than just info on which files
>> changed).
>
> but they're not copies, they're hardlinks. I guess I don't understand
> the problem.
> [...]

I understand that (actually it's the whole point of using a
hardlinked-based snapshot system :-)). What I meant by "massive
filesystem" was a huge number of additional file entries, that gets
created for every snapshot. That causes major problems for utilities
like du and locate that need to walk the entire filesystem. That's why
I've been making scripts to "prune" files, so that there are fewer
filesystem entries to be walked by those tools.

>
> If you are using hardlinks, and nice discrete directories for each
> machine, then a machine that has infrequent changes will not use a lot
> of space because the files don't change.
> [...]

Thanks, also understand that. My problem however, is that all of the
backups (for all the servers) are on a single LVM partition. But when
the LVM is full, then I need to run a tool like 'du' to check where
space can be reclaimed. That's no longer working nicely (takes days to
finish, and makes huge, multi-gb output files). From experience I know
which servers are more likely to have the most disk usage "churn", and
I've been removing those one's oldest entries recently, to recover
space, but I'd like to also be able to run 'du' effectively, rather
than relying on hunches.

>
> you should be able to look at the difference between disk usage over
> different time periods and figure out your "burn rate".
> [...]

Thanks for those ideas also (also considered this). However this still
doesn't let me to use tools like du and locate nicely, due to the huge
number of filesystem entries (and I really want to be able to use
those tools, or at least du, to be able to actually check where
harddrive space is being used). Again, that's why I'm having to
consider pruning-type approaches (which seem like an awful hack, but
I'm not sure of a better method at this time).

> I suspect I'm telling you stuff you already know and I apologize if I
> appear condescending. The odds are you probably know more about
> backups than I do. hth.

Nah, it's fine :-). Better too many ideas, rather than assuming I'm
aware of all the possible options, and leaving out something that
might have been useful.

On Fri, Aug 14, 2009 at 5:49 PM, Alan
Chandler wrote:
> Andrew Sackville-West wrote:
> I'm not sure I understood what you are after either.  Admittedly on a rather
> small home server, I use the cp -alf command to have only changed files kept
> for a long time

Thanks for those cron and script entries. I guess I could use
something like that too (have X dailies, Y weeklies, Z monthlies,
etc), and it would save more harddrive space. But actually, managing
generations of backups to conserve harddrive space (and still have
some really old backups) isn't really the problem. The problem is:

1) Source servers (fileservers, etc) have millions+ files. There are a
couple of servers like this.

2) Hardlink snapshot process makes a duplicate of the filesystem
structure (for each of the above servers) each time on the backup
server.

3) Backup server ends up with an exponentially large number of files,
compared to any of the servers actually being backed up

The filesystem can support (3), but utilities like 'du' and 'updatedb'
become almost unusable. That's the main problem.

Basically, the problem isn't that I don't know how to use rsync, cp,
etc to make the backups, manage generations,  etc... the problem is an
incredibly large filesystem (as in number of hardlinks, and actual
directories, to a smaller extent), resulting from the hardlink
snapshot-based approach (as opposed to something like rdiff-backup,
which only stores the differences between the generations).

On Sat, Aug 15, 2009 at 4:35 AM, Rob Owens wrote:
>>
> You might want to check out BackupPC.  It uses hardlinks and
> compression.  It's got a web-based GUI which makes it pretty easy to
> find statistics on disk space used, on a per-server basis.
>

I've been researching the various other, more integrated backup
solutions (amanda, bacula, etc), but I have two main problems with
them:

1) They are too over-engineered/complex for my liking, and the docs
are hard to understand. I prefer simple command-line tools like rsync,
etc, which I can script. Also don't really want to have to install
their special backup tool-specific services everywhere on the network
if I can avoid it.

2) I can't find information on how most of them actually store their
backed up data. So they could very well have either the same problem,
or have other issues that I become unable to work around if I want to
use that tool.

Thanks for your backuppc suggestion. I have h

Re: Problems with making hardlink-based backups

2009-08-17 Thread David
On Sat, Aug 15, 2009 at 4:35 AM, Rob Owens wrote:
> You might want to check out BackupPC.  It uses hardlinks and
> compression.  It's got a web-based GUI which makes it pretty easy to
> find statistics on disk space used, on a per-server basis.
>

I've been researching backuppc, and it seems like it wants to store
everything in a pool, including the latest backup. Is there a way to
keep the latest backup outside the pool area?

Reason being, that while the pool is a very space-efficient, the
layout is somewhat opaque, and afaict it's not very straightforward to
get to the actual backed up files (by scripts, admin users, etc,
logged into the backup server).

Places where I'm forseeing problems:

1) Offsite-backups.

My current scripts use rsync to update the latest snapshots (for each
user, server, etc), over to a set of external drives. With backuppc,
I'll probably have to find the correct backuppc script incantation (or
hack together something), to restore the latest backup to a temporary
location on the backup server, before copying over to the external
drive.

Problems:

a. Complicated

b. Going to be slow (slower than if there was an existing directory)

c. Going to use up a lot of extra harddrive space on the backup
server, to store the restored snapshot (for eg: backed up file
servers). Unless I work out something ugly whereby uncompressed
backuppc hardlinks are linked to a new structure.. (this is incredibly
ugly).

d. Inefficient - if only a few files have changed on a huge backed-up
filesystem, you still need to restore the entire snapshot out of the
backuppc pool.

2) Admin-friendly.

It's simpler for admins to find browse through files in a directory
structure on the backup server, on a command-line (or with winscp or
samba or whatever), rather than having to go through a web frontend.
99% of the time they're looking for stuff from the latest snapshot, so
it's acceptible for them (or myself) to have to run special commands
to get to the older versions. But the latest snapshot I do actually
want to be present on the harddrive (rather than hidden away in a
pool).

3) Utility-friendly.

With a directory structure, I can run du and determine which files are
huge, or use other unixy things. Without it, I and scripts, admins,
etc, have to go through the backuppc-approved channels ... unnecessary
complication imo.

---

I guess one way to do this, is to use the regular rsync-based backup
methods, to make/update the latest snapshot, and then backup that with
backuppc. But that has the following disadvantages:

1) Lots more disk usage.

 Backuppc would be making an indepdendant copy of all the data. It
won't be eg, making hardlinks against the latest snapshot, or reverse
incrementals, or something like that.

2) Redundnant and complicated.

Backuppc is meant to be a "one stop", automated thing. If I'm already
handling scheduling and the actual transports, etc from my scripts,
then it's redundant. All that it's being used for is it's pooled
approach, which still has the above problems.

---

Basically.. what I would require from backuppc, is a way to tell it to
preserve a local copy of the latest snapshots (in easy-to-find
locations on the backup server, so admins or scripts can use them
directly), and to only move older versions to the pool... while at the
same time taking advantage of the latest snapshot to conserve backup
server harddrive space (reverse incremental, hardlink to it, etc).

Does anyone who is familiar with backuppc know if the above is possible?

(Although I kind of doubt it at this point. My use cases seem to break
the backuppc design ^^; )

I should probably post about this to the backuppc mailing lists too..
their users would have a lot more relevant experience. In the
meanwhile, I'll probably continue to use a pruned hardlinks approach.

David.


--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-17 Thread David
Sorry for spamming the list..

I think I didn't read the docs correctly, before posting the above. It
seems that backuppc actully does keep recent snapshots that aren't in
the pool... so scripts, admins, etc can get to them easily without
going through backuppc scripts.

It looks like backuppc actually maintain a hardlinked version of the
backed up server, outside the pool.

Specifically, the section about "TOPDIR__/pool", in the docs:

http://backuppc.sourceforge.net/faq/BackupPC.html

I should read the docs more carefully :-)

David.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-17 Thread David
Err.. and another post on backuppc, sorry.

I think that backuppc is actually going to have the same problem (with
massive filesystems causing du and locate, etc to become next to
unusable for the backup storage directories). The reason for this:

"Therefore, every file in the pool will have at least 2 hard links
(one for the pool file and one for the backup file below
__TOPDIR__/pc). Identical files from different backups or PCs will all
be linked to the same file. When old backups are deleted, some files
in the pool might only have one link. BackupPC_nightly checks the
entire pool and removes all files that have only a single link,
thereby recovering the storage for that file."

ie, there are actually hardlinks for every file for every server for
every backup generation.  Still going to have a bazillion files for du
and locate to go through, even if they are stored in a nice pool
system.

BackupPC has some nice features, but it's not going to fix my problem :-(.

Ideally I would have kept using rdiff-backup, but for now I'm going to
go with hardlink snapshots & pruning (with text file restore info)
details.

Is my use case really that unusual? (wanting to run 'du' and 'locate'
on a backup server, which has a lot of generations of data from other
servers that contain a huge number of files themself).

Going to ask about this general problem over at the backuppc mailing
list, maybe people there have more ideas :-)

David.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-17 Thread Andrew Sackville-West
On Mon, Aug 17, 2009 at 10:59:20AM +0200, David wrote:
> Thanks for the replies.
[...]
> 
> Basically, the problem isn't that I don't know how to use rsync, cp,
> etc to make the backups, manage generations,  etc... the problem is an
> incredibly large filesystem (as in number of hardlinks, and actual
> directories, to a smaller extent), resulting from the hardlink
> snapshot-based approach (as opposed to something like rdiff-backup,
> which only stores the differences between the generations).
[...]

ah. well, that is a problem isn't it. I can see why you'd like to
stick with a diff based backup then. Is there someway you can control
the number of files by tarring up sections of the filesystem prior to
backup? If you have a lot of high churn files, then you'll likely be
duplicating them anyway, so tarring up the whole lot might make
sense. Then you backup the tarballs instead. 

Here's another question: what is stored in all these millions of
files? And what is their purpose. Is it a case of using a filesystem
when a database might be a better option? Perhaps the whole problem
you're facing on the backend could be better solved by looking at the
front end. Of course, you'll want to avoid the tail wagging the
dog... 

just a couple of thoughts.

good luck.

A


signature.asc
Description: Digital signature


Re: Problems with making hardlink-based backups

2009-08-18 Thread David
On Mon, Aug 17, 2009 at 6:26 PM, Andrew
Sackville-West wrote:
> Here's another question: what is stored in all these millions of
> files?
> [...]

Basically, the lions share would be tonnes of user-generated files,
for example huge numbers of image files (and thumbnails) that get
stored in directory structures on one of the file servers. Other
examples would be extensive music & sound libraries, several
debian/ubuntu/etc mirrors, and so on.

About tarring before backing up. Yeah, that's possible too (for some
types of data/directory layouts). But then something needs to (on the
file server side), check if the tars are still up to date. And also
those tars will take up a lot of precious harddrive space on the file
server :-(. Unless you mean remove the original data.. which is
problematic in a few ways. And of course, storing different versions
of those tars (eg: users move files around at the source) is also
problematic.

Basically... as you say it would be like tail wagging the dog. Things
would get a lot more complicated & fragile, and in exchange I get a
lot of other, more serious backup problems, which are harder to work
around than the current issues.

About moving to database. Well the filesystem is already a database
:-). And then trying to keep backups of that (multi-TB) database
itself is a major problem. Not to mention, users and software now have
to go through some other software to get to their files... don't want
to go there.. my head hurts ^^;

The file servers themselves do have a large number of files... that
isn't really the problem. The problem is actually in the backup
software which causes issues trying to handle history for those
backups (either using massive amounts of memory/cpu, or creating
massive numbers of hardlinks, and so on).

Basically, rdiff-backup was perfect for a while. But then we upgraded
the server to Lenny. And then it stopped working T_T. I think that
rdiff-backup's author must have changed something, which now causes
huge ram usage for large file lists, or other per-file data of some
kind. imo that's unnecessary (it could just use something like a set
of Python iterators in a clever way, or work with incremental file
lists like rsync), but I didn't get any useful replies on their
mailing list when I mentioned my problem and gave a few ideas.

So for now, a combination of ugly hacks with hardlink-type pruning for
history snapshots, and blindly deleting older backup generations to
get space back when needed. At least until I find a better solution.

Anyway, thanks for your ideas :-)

David.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: Problems with making hardlink-based backups

2009-08-18 Thread Andrew Sackville-West
On Tue, Aug 18, 2009 at 03:11:47PM +0200, David wrote:
[...]
> Basically, rdiff-backup was perfect for a while. But then we upgraded
> the server to Lenny. And then it stopped working T_T. I think that
> rdiff-backup's author must have changed something, which now causes
> huge ram usage for large file lists, or other per-file data of some
> kind. imo that's unnecessary (it could just use something like a set
> of Python iterators in a clever way, or work with incremental file
> lists like rsync), but I didn't get any useful replies on their
> mailing list when I mentioned my problem and gave a few ideas.
> 

I've had rdiff-backup fail because of mis-matched versions. Again, not
to belabor the obvious, but do you have compatible versions of
rdiff-backup on each machine? If you have compatible (i.e., the same)
versions on both ends and still have problems, then perhaps you should
file a bug report.

A


signature.asc
Description: Digital signature


Re: Problems with making hardlink-based backups

2009-08-19 Thread David
On Tue, Aug 18, 2009 at 5:11 PM, Andrew
Sackville-West wrote:
> I've had rdiff-backup fail because of mis-matched versions. Again, not
> to belabor the obvious, but do you have compatible versions of
> rdiff-backup on each machine? If you have compatible (i.e., the same)
> versions on both ends and still have problems, then perhaps you should
> file a bug report.
>

I'm well aware of those problems, so I don't even bother to use
rdiff-backup over the network. That mode is next to useless unless you
can guarantee the same versions, and tbh, it sucks compared to rsync
for network transfers.

How I use rdiff-backup is like this:

1) Make a temporary snapshot copy of the rdiff-backup repo (minus the
rdiff-backup-data directory), using hardlinks.

2) rsync from the source server over to the temporary copy

(this should be safe, since rsync doesn't overwrite files in place
unless you tell it to)

3) Run rdiff-backup to push the latest temporary copy onto the
rdiff-backup history.

The above works fairly well for me, although rdiff-backup sometimes
gets confused about the hardlinks.

About filing a bug report, not sure how much that's going to help,
since the mailing list wasn't very informative. I get the idea that
the main developer has either abandoned the project, or is taking a
break for a few months (as evidenced by the recent bug tracker
activity, and the lack of useful replies in the mailing list).

David.


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org