On Sun, Mar 14, 2010 at 03:31:13PM +0100, Maarten Bezemer wrote: > > On Sat, 13 Mar 2010, Whit Blauvelt wrote: > > >On Sat, Mar 13, 2010 at 11:58:42PM +0100, Jernej Simonÿÿiÿÿ wrote: > > > >>I'd say this is expected behaviour - the destination saw the file on > >>previous run, but didn't see it on current run (because the source > >>likely doesn't inform it about files it skips), so it treats the file > >>as deleted on source. > > > >Probably so. A corner case then. Even though it would be easy for the source > >to inform it about files skipped and avoid this, it's probably not worth the > >coding effort. > > I don't think this is even a corner case. If you want to exclude > large files, then a file that is larger than the limit you specify > (something you explicitly and deliberatly do!) should not be in the > backup. Also, it should not _remain_ in the 'current' backup tree, > because it would no longer match the original in the source tree. > Since rdiff-backup keeps history of the backups, there is no other > way than to treat it as 'deleted from the source'. That's the only > way to keep the history intact AND have a proper 'current' backup > tree.
Here's how the corner case occurs: On earlier runs you have included larger files. On a later run you decide to set a lower threshold for the maximum file size to include in the run. But now rdiff-backup, because it doesn't check whether a larger file has been deleted on the original, decides to apply a rule to "treat as deleted on original" any files larger than the new threshold. This can result in use of significant system resources gzipping very large files, over what can be hours if there are several of them, on the logic that they are deleted on the original. But they're not. Now, what you're proposing is in effect a rule that says "Once a backup is done with a certain size threshold, no subsequent backup of the same tree should be done with a lower size threshold." It would be possible to enforce that in the software. Not that I'd recommend it. But it could store the last threshold used (if any) and refuse to run if it's lowered. Or it would be possible to enforce that rule by putting an explanation of it in the user documentation. Have I missed this being documented? To expect the user to just intuitively know about a design decision deep in the code like this is _not_ the best way to enforce such a rule. As far as intact history goes, that's a side issue here, isn't it? Rdiff-backup is keeping the larger file, which may now be dated, in either case. The difference is whether it's gzipping it. Our difference of opinion is whether that extra work, of gzipping, should be triggered at this event, when it wouldn't otherwise be, but for the lowered threshold putting it into the "treat as deleted on original" category when it really hasn't been deleted there. Or to be clearer, rdiff-backup could add a new classification to its database. If you're right the history data conflates two facts: whether a file existed on the original system at a certain time, and whether rdiff-backup has preserved a copy of the file as of then. It would be more accurate to keep a complete record of all files that were in the target directory tree on the original system at each time slice, and then have a separate field in which rdiff-backup stores whether or not it has kept a copy of the file itself as it existed then. That might not just avoid treating a file as if deleted on the original when it hasn't been, but support actions like running rdiff-backup at regular intervals during working hours just against smaller files, while running a daily backup of even the large stuff every night, without having to establish two redundant backup spaces to accommodate this. > >Another question comes up though. If gzip'ing a huge file can cause a > >resonably fast machine to tie up considerable resources for > 30 minutes > >because it's logic tells it it's time to gzip a 16g file, it would be good > >if there's a way to ask it not to do that. > (BTW, spending 30 minutes on a 16GB file, I don't think that would > be so strange. Even md5sum-ing a 4.7GB iso image can take a few > minutes on a busy system with lots of disk i/o.) Not strange. Just perhaps not necessary or desired here. > >I see that compression can be > >turned off for all files, but not how to turn compression off just for the > >largest files. Is there some trick that would accomplish that? Basically, > >compression on smaller files is always good; compression on the very largest > >files almost always bad; and somewhere in between - depending on system > >resources - it gets iffy. It would be useful to have a flag to set a > >file-size threshold where only files below that would compress. > > These are quite strong claims without any proof or supporting theory. > Compressing a 7KB file might indeed make it considerably smaller, > suppose it would be 4.1K when zipped. But on file systems with 4KB > blocks, that would not even save 1 block. And filesystems supporting > multiple 16GB files tend to have larger block sizes... > Larger files on the other hand can often be compressed with much > larger space-savings. As always, it all depends on the type of data > in the files, so YMMV. Good points. But let me rephrase the claims more clearly. (Language can be too broad a brush for technical discussions.) If the user's goal is to compromise between the goals of (1) having each run finish in a reasonable time and (2) have the backups take up less-than-maximum space, then a reasonable heuristic is to gzip files small enough for the gzip'ing to happen rapidly (it's very fast on small files), and not gzip files large enough that gzip'ing becomes a process of many tens of minutes. Yes, some smaller files are worthless to gzip; yes, some larger files gzip down tremendously (if slowly), while others hardly at all. Still, a threshold based on size will keep the rdiff-backup run time reasonably quick, while also getting some degree of storage space advantage from gzip. As it is, the only way to assure fast runs when large files are included is to turn off gzip'ing entirely. But we can have nearly-as-fast runs with a degree of space saving by having gzip'ing on only for files smaller than the threshold at which a certain system is capable of gzip'ing them rapidly. That's the feature I'm suggesting. It's not based on much experience with rdiff-backup, but I've been gzipping for many years, so I'm confident of my suggestion that this would work, and be useful - at least if there are other users who share my goal of a compromise between minimum run times and storage space efficiency that favors the run times without totally giving up on the efficiency. Your own suggestions are clever but harder to implement. The switch I'm suggesting be added would only be a few lines of code. Something like "--leave-uncompressed-if-size-greater-than" which passes a size variable which the compression routine checks against the file size before commencing, or not. Trivial, really. But if I'm the only user who'd benefit from it, perhaps not required. Best, Whit _______________________________________________ rdiff-backup-users mailing list at rdiff-backup-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki