Re: [qubes-devel] ANN: Fast incremental backups project

Chris Laprise Sat, 29 Dec 2018 11:30:53 -0800

On 12/22/2018 08:48 PM, Marek Marczykowski-Górecki wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256


On Fri, Dec 21, 2018 at 08:39:43AM -0500, Chris Laprise wrote:

On 12/20/2018 09:40 PM, Marek Marczykowski-Górecki wrote:

Thanks for doing this!

I haven't really looked at the code, but I have more generic comment:

The idea of small, frequent snapshots to collect modified blocks bitmaps
is neat. But I'd really, really like to avoid inventing yet another
backup archive format. The current qubes backup format have its own
limitations and while I have some ideas[1] how to plug incremental
backups there, I don't think there is a future in that. On the other
hand, there are already solutions using very similar approach for
handling incremental backup (basically, do not differentiate between
"full", "incremental" and "differential" backups, but split data set
into chunks and send only those not already present in backup archive).
And those already have established formats, including encryption and
integrity protection. Specifically, I'm looking into two of them:
   - duplicity
   - BorgBackup


I think its about time someone in open source created an analog to the Time
Machine sparsebundle format, just because its so effective and _simple_:
Fixed size chunks of the volume stored as files with filenames representing
addresses, and manifests with sha256 hashes. There's scarcely anything more
to it than that, and its simple enough to be processed by shell commands
like find, cp and zcat (see 'spbk-assemble' for a functional example).

This already works on millions of Mac systems where people expect it to
provide hourly backups without noticeably affecting system resources. This
class of format I don't mind creating; I think Apple chose well.


Can you point at the documentation of encryption scheme used by
Time Machine backups?

This is the weakest part of my effort so far: Scant planning forencryption (and IANAC). Its the one struggle I see coming with this project.

I recognize the problem as one involving block-based ciphers/modes andthe level of resistance they offer against any spy who can viewsuccessive chunk updates. My understanding of the Time Machine method isthat its similar if not identical to encryption of a normal disk volume(or a 'normal' loop dev that happens to be in chunks). If so, I may tryto implement something close to it using python aes but will still seekinput from a cryptographer which should be done in any case.

FWIW, I've considered trying a modified version of the scrypt methodused in qvm-backup. Sparsebak can be used in a tarfile mode, forinstance, which makes this practical but has the side-effect of removingpruning ability.


Also note that we'd like to have at least some level of hiding metadata
- - like VM names (leaked through file names).

I have an idea for a relatively simple obfuscation layer that could evenre-order the transmission of chunks in addition to concealing filenames.It would use an additional index with randomized names and the ordershuffled. Implementing this, I surmise, could improve robustness of theencryption.

As for borg, I'm not sure a heavy emphasis on deduplication is appropriate
for many PC applications. Its a resource drain that leads to complex archive
formats on the back end. And my initial testing suggests the dedup efficacy
is oversold: Sparsebak can sometimes produce smaller multi-generation
archives even without dedup.


Not arguing with this. I think borg could be good enough in our case
with fixed-size chunks, using your way of detecting what have changed.
Deduplication here would be mostly about re-using old chunks (already in
backup archive) for new backup - so, the "incremental" part.
I just want to avoid re-inventing compressed and encrypted archive
format (a mistake we've made before). Borg already have established
format for that.

Yes, keeping in mind the chunk size I'm using currently is 128kB withfixed boundaries. I've experimented with simple retroactive dedup basedon sorting the manifest hashes and that can save a little space withalmost no time/power cost. This could be done at send time to savebandwidth, but that savings may not be worth it. OTOH, if we expect someusers to backup related cloned VMs (common with templates) the potentialsavings then becomes very significant even with this simple method.

To be sure, borg gets better dedup with arbitrary data input, but evenso that looks to be around 2-4%. Would I work days to add thatefficiency to sparsebak's COW awareness? Probably. Months? Probably not.If borg were to be integrated at all, it would need modification toaccept named objects (sparsebak chunks) streamed into it and have someway of indicating an incremental backup so there is a namespaceintegration between successive backup sessions.

I've also done another test that should be a better indicator ofrelative speed. It uses a new 'qubes-ssh' protocol option I added sothat loopdev + fuse layers aren't a factor. With this, sparsebak isconsistently faster than borg over local 802.11n wifi for both initialand incremental backups. An assumption here is that adding encryptionwill not have a large impact -- but also keeping in mind sparsebak hasno multiprocessing or optimizations as of yet.

Actually this is one of Sparsebak's strong points... very low interactivity
during remote operations.


But as far as I understand, to get most of it, you need
hardlink-compatible storage, which for example exclude most cloud
services...

Hardlinks are currently used for housekeeping operations (i.e. mergeduring pruning), but I didn't follow Apple's example to the extent thateach incremental session must look like a whole volume (where theyreally lean on hardlinks). Instead, manifests are quickly assembled onan as-needed basis to create a meta-index for a complete volume view. Sothe question is: can session merging use a method without hardlinks? Ithink it could use 'move' -- and possibly gain encfs as an encryptionoption in the process.

As for targeting cloud storage, that will take some time on my part asits not something I normally use. Although its becoming less necessary,my original concept for the storage end was a Unix environment (GNU +python currently required) with contemporary mainstream filesystems thatcan handle and rapidly process large numbers of files. This is accessedvia a qube or protocol like ssh. Maybe cloud storage apis could betargetted, but they might not be practical for volumes over a certain size.


--

Chris Laprise, tas...@posteo.net
https://github.com/tasket
https://twitter.com/ttaskett
PGP: BEE2 20C5 356E 764A 73EB  4AB3 1DC4 D106 F07F 1886

--
You received this message because you are subscribed to the Google Groups 
"qubes-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-devel+unsubscr...@googlegroups.com.
To post to this group, send email to qubes-devel@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-devel/16c1fd85-bd02-41bd-5d65-23b7934e8977%40posteo.net.
For more options, visit https://groups.google.com/d/optout.

Re: [qubes-devel] ANN: Fast incremental backups project

Reply via email to