RE: [gentoo-user] How to compress lots of tarballs

Laurence Perkins Wed, 29 Sep 2021 13:28:05 -0700

> > On Wed, Sep 29, 2021 at 4:27 AM Peter Humphrey <pe...@prh.myzen.co.uk> 
> > wrote:
> >> Thanks Laurence. I've looked at borg before, wondering whether I 
> >> needed a more sophisticated tool than just tar, but it looked like 
> >> too much work for little gain. I didn't know about duplicity, but I'm 
> >> used to my weekly routine and it seems reliable, so I'll stick with 
> >> it pro tem. I've been keeping a daily KMail archive since the bad old 
> >> days, and five weekly backups of the whole system, together with 12 
> >> monthly backups and, recently an annual backup. That last may be overkill, 
> >> I dare say.
> > I think Restic might be gaining some ground on duplicity.  I use 
> > duplicity and it is fine, so I haven't had much need to look at 
> > anything else.  Big advantages of duplicity over tar are:
> >
> > 1. It will do all the compression/encryption/etc stuff for you - all 
> > controlled via options.
> > 2. It uses librsync, which means if one byte in the middle of a 10GB 
> > file changes, you end up with a few bytes in your archive and not 10GB 
> > (pre-compression).
> > 3. It has a ton of cloud/remote backends, so it is real easy to store 
> > the data on AWS/Google/whatever.  When operating this way it can keep 
> > local copies of the metadata, and if for some reason those are lost it 
> > can just pull that only down from the cloud to resync without a huge 
> > bill.
> > 4. It can do all the backup rotation logic (fulls, incrementals, 
> > retention, etc).
> > 5. It can prefix files so that on something like AWS you can have the 
> > big data archive files go to glacier (cheap to store, expensive to 
> > restore), and the small metadata stays in a data class that is cheap 
> > to access.
> > 6. By default local metadata is kept unencrypted, and anything on the 
> > cloud is encrypted.  This means that you can just keep a public key in 
> > your keyring for completely unattended backups, without fear of access 
> > to the private key.  Obviously if you need to restore your metadata 
> > from the cloud you'll need the private key for that.
> >
> > If you like the more tar-like process another tool you might want to 
> > look at is dar.  It basically is a near-drop-in replacement for tar 
> > but it stores indexes at the end of every file, which means that you 
> > can view archive contents/etc or restore individual files without 
> > scanning the whole archive.  tar was really designed for tape where 
> > random access is not possible.
> >
> 
> 
> Curious question here.  As you may recall, I backup to a external hard drive. 
>  Would it make sense to use that software for a external hard drive?  Right 
> now, I'm just doing file updates with rsync and the drive is encrypted.  
> Thing is, I'm going to have to split into three drives soon.  So, compressing 
> may help.  Since it is video files, it may not help much but I'm not sure 
> about that.  Just curious. 
> 
> Dale
> 
> :-)  :-) 
> 
>


If I understand correctly you're using rsync+tar and then keeping a set of 
copies of various ages.

If you lose a single file that you want to restore and have to go hunting for 
it, with tar you can only list the files in the archive by reading the entire 
thing into memory and only extract by reading from the beginning until you 
stumble across the matching filename.  So with large archives to hunt through, 
that could take...  a while...

dar is compatible with tar (Pretty sure, would have to look again, but I 
remember that being one of its main selling-points) but adds an index at the 
end of the file allowing listing of the contents and jumping to particular 
files without having to read the entire thing.  Won't help with your space 
shortage, but will make searching and single-file restores much faster.

Duplicity and similar has the indices, and additionally a full+incremental 
scheme.  So searching is reasonably quick, and restoring likewise doesn't have 
to grovel over all the data.  It can be slower than tar or dar for restore 
though because it has to restore first from the full, and then walk through 
however many incrementals are necessary to get the version you want.  This 
comes with a substantial space savings though as each set of archive files 
after the full contains only the pieces which actually changed.  Coupled with 
compression, that might solve your space issues for a while longer.

Borg and similar break the files into variable-size chunks and store each chunk 
indexed by its content hash.  So each chunk gets stored exactly once regardless 
of how many times it may occur in the data set.  Backups then become simply 
lists of file attributes and what chunks they contain.  This results both in 
storing only changes between backup runs and in deduplication of 
commonly-occurring data chunks across the entire backup.  The database-like 
structure also means that all backups can be searched and restored from in 
roughly equal amounts of time and that backup sets can be deleted in any order. 
 Many of them (Borg included) also allow mounting backup sets via FUSE.  The 
disadvantage is that restore requires a compatible version of the backup tool 
rather than just a generic utility.

LMP

RE: [gentoo-user] How to compress lots of tarballs

Reply via email to