Re: [rdiff-backup-users] Q. on max-file-size behavior

Whit Blauvelt Sun, 14 Mar 2010 09:55:19 -0700

On Sun, Mar 14, 2010 at 03:31:13PM +0100, Maarten Bezemer wrote:
> 
> On Sat, 13 Mar 2010, Whit Blauvelt wrote:
> 
> >On Sat, Mar 13, 2010 at 11:58:42PM +0100, Jernej Simonÿÿiÿÿ wrote:
> >
> >>I'd say this is expected behaviour - the destination saw the file on
> >>previous run, but didn't see it on current run (because the source
> >>likely doesn't inform it about files it skips), so it treats the file
> >>as deleted on source.
> >
> >Probably so. A corner case then. Even though it would be easy for the source
> >to inform it about files skipped and avoid this, it's probably not worth the
> >coding effort.
> 
> I don't think this is even a corner case. If you want to exclude
> large files, then a file that is larger than the limit you specify
> (something you explicitly and deliberatly do!) should not be in the
> backup. Also, it should not _remain_ in the 'current' backup tree,
> because it would no longer match the original in the source tree.
> Since rdiff-backup keeps history of the backups, there is no other
> way than to treat it as 'deleted from the source'. That's the only
> way to keep the history intact AND have a proper 'current' backup
> tree.


Here's how the corner case occurs: On earlier runs you have included larger
files. On a later run you decide to set a lower threshold for the maximum
file size to include in the run. But now rdiff-backup, because it doesn't
check whether a larger file has been deleted on the original, decides to
apply a rule to "treat as deleted on original" any files larger than the new
threshold. This can result in use of significant system resources gzipping
very large files, over what can be hours if there are several of them, on
the logic that they are deleted on the original. But they're not.

Now, what you're proposing is in effect a rule that says "Once a backup is
done with a certain size threshold, no subsequent backup of the same tree
should be done with a lower size threshold." It would be possible to enforce
that in the software. Not that I'd recommend it. But it could store the last
threshold used (if any) and refuse to run if it's lowered. Or it would be
possible to enforce that rule by putting an explanation of it in the user
documentation. Have I missed this being documented? To expect the user to
just intuitively know about a design decision deep in the code like this is
_not_ the best way to enforce such a rule.

As far as intact history goes, that's a side issue here, isn't it?
Rdiff-backup is keeping the larger file, which may now be dated, in either
case. The difference is whether it's gzipping it. Our difference of opinion
is whether that extra work, of gzipping, should be triggered at this event,
when it wouldn't otherwise be, but for the lowered threshold putting it into
the "treat as deleted on original" category when it really hasn't been
deleted there.

Or to be clearer, rdiff-backup could add a new classification to its
database. If you're right the history data conflates two facts: whether a
file existed on the original system at a certain time, and whether
rdiff-backup has preserved a copy of the file as of then. It would be more
accurate to keep a complete record of all files that were in the target
directory tree on the original system at each time slice, and then have a
separate field in which rdiff-backup stores whether or not it has kept a
copy of the file itself as it existed then.

That might not just avoid treating a file as if deleted on the original when
it hasn't been, but support actions like running rdiff-backup at regular
intervals during working hours just against smaller files, while running a
daily backup of even the large stuff every night, without having to
establish two redundant backup spaces to accommodate this.

> >Another question comes up though. If gzip'ing a huge file can cause a
> >resonably fast machine to tie up considerable resources for > 30 minutes
> >because it's logic tells it it's time to gzip a 16g file, it would be good
> >if there's a way to ask it not to do that.

> (BTW, spending 30 minutes on a 16GB file, I don't think that would
> be so strange. Even md5sum-ing a 4.7GB iso image can take a few
> minutes on a busy system with lots of disk i/o.)

Not strange. Just perhaps not necessary or desired here.

> >I see that compression can be
> >turned off for all files, but not how to turn compression off just for the
> >largest files. Is there some trick that would accomplish that? Basically,
> >compression on smaller files is always good; compression on the very largest
> >files almost always bad; and somewhere in between - depending on system
> >resources - it gets iffy. It would be useful to have a flag to set a
> >file-size threshold where only files below that would compress.
> 
> These are quite strong claims without any proof or supporting theory.
> Compressing a 7KB file might indeed make it considerably smaller,
> suppose it would be 4.1K when zipped. But on file systems with 4KB
> blocks, that would not even save 1 block. And filesystems supporting
> multiple 16GB files tend to have larger block sizes...
> Larger files on the other hand can often be compressed with much
> larger space-savings. As always, it all depends on the type of data
> in the files, so YMMV.

Good points. But let me rephrase the claims more clearly. (Language can be
too broad a brush for technical discussions.) If the user's goal is to
compromise between the goals of (1) having each run finish in a reasonable
time and (2) have the backups take up less-than-maximum space, then a
reasonable heuristic is to gzip files small enough for the gzip'ing to
happen rapidly (it's very fast on small files), and not gzip files large
enough that gzip'ing becomes a process of many tens of minutes. Yes, some
smaller files are worthless to gzip; yes, some larger files gzip down
tremendously (if slowly), while others hardly at all. Still, a threshold
based on size will keep the rdiff-backup run time reasonably quick, while
also getting some degree of storage space advantage from gzip.

As it is, the only way to assure fast runs when large files are included is
to turn off gzip'ing entirely. But we can have nearly-as-fast runs with a
degree of space saving by having gzip'ing on only for files smaller than the
threshold at which a certain system is capable of gzip'ing them rapidly.
That's the feature I'm suggesting. It's not based on much experience with
rdiff-backup, but I've been gzipping for many years, so I'm confident of my
suggestion that this would work, and be useful - at least if there are other
users who share my goal of a compromise between minimum run times and
storage space efficiency that favors the run times without totally giving up
on the efficiency.

Your own suggestions are clever but harder to implement. The switch I'm
suggesting be added would only be a few lines of code. Something like 
"--leave-uncompressed-if-size-greater-than" which passes a size variable
which the compression routine checks against the file size before
commencing, or not. Trivial, really. But if I'm the only user who'd benefit
from it, perhaps not required.

Best,
Whit


_______________________________________________
rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Re: [rdiff-backup-users] Q. on max-file-size behavior

Reply via email to