On 12/22/2018 08:48 PM, Marek Marczykowski-Górecki wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Dec 21, 2018 at 08:39:43AM -0500, Chris Laprise wrote:
On 12/20/2018 09:40 PM, Marek Marczykowski-Górecki wrote:
Thanks for doing this!

I haven't really looked at the code, but I have more generic comment:

The idea of small, frequent snapshots to collect modified blocks bitmaps
is neat. But I'd really, really like to avoid inventing yet another
backup archive format. The current qubes backup format have its own
limitations and while I have some ideas[1] how to plug incremental
backups there, I don't think there is a future in that. On the other
hand, there are already solutions using very similar approach for
handling incremental backup (basically, do not differentiate between
"full", "incremental" and "differential" backups, but split data set
into chunks and send only those not already present in backup archive).
And those already have established formats, including encryption and
integrity protection. Specifically, I'm looking into two of them:
   - duplicity
   - BorgBackup


I think its about time someone in open source created an analog to the Time
Machine sparsebundle format, just because its so effective and _simple_:
Fixed size chunks of the volume stored as files with filenames representing
addresses, and manifests with sha256 hashes. There's scarcely anything more
to it than that, and its simple enough to be processed by shell commands
like find, cp and zcat (see 'spbk-assemble' for a functional example).

This already works on millions of Mac systems where people expect it to
provide hourly backups without noticeably affecting system resources. This
class of format I don't mind creating; I think Apple chose well.

Can you point at the documentation of encryption scheme used by
Time Machine backups?

This is the weakest part of my effort so far: Scant planning for encryption (and IANAC). Its the one struggle I see coming with this project.

I recognize the problem as one involving block-based ciphers/modes and the level of resistance they offer against any spy who can view successive chunk updates. My understanding of the Time Machine method is that its similar if not identical to encryption of a normal disk volume (or a 'normal' loop dev that happens to be in chunks). If so, I may try to implement something close to it using python aes but will still seek input from a cryptographer which should be done in any case.

FWIW, I've considered trying a modified version of the scrypt method used in qvm-backup. Sparsebak can be used in a tarfile mode, for instance, which makes this practical but has the side-effect of removing pruning ability.


Also note that we'd like to have at least some level of hiding metadata
- - like VM names (leaked through file names).

I have an idea for a relatively simple obfuscation layer that could even re-order the transmission of chunks in addition to concealing filenames. It would use an additional index with randomized names and the order shuffled. Implementing this, I surmise, could improve robustness of the encryption.


As for borg, I'm not sure a heavy emphasis on deduplication is appropriate
for many PC applications. Its a resource drain that leads to complex archive
formats on the back end. And my initial testing suggests the dedup efficacy
is oversold: Sparsebak can sometimes produce smaller multi-generation
archives even without dedup.

Not arguing with this. I think borg could be good enough in our case
with fixed-size chunks, using your way of detecting what have changed.
Deduplication here would be mostly about re-using old chunks (already in
backup archive) for new backup - so, the "incremental" part.
I just want to avoid re-inventing compressed and encrypted archive
format (a mistake we've made before). Borg already have established
format for that.

Yes, keeping in mind the chunk size I'm using currently is 128kB with fixed boundaries. I've experimented with simple retroactive dedup based on sorting the manifest hashes and that can save a little space with almost no time/power cost. This could be done at send time to save bandwidth, but that savings may not be worth it. OTOH, if we expect some users to backup related cloned VMs (common with templates) the potential savings then becomes very significant even with this simple method.

To be sure, borg gets better dedup with arbitrary data input, but even so that looks to be around 2-4%. Would I work days to add that efficiency to sparsebak's COW awareness? Probably. Months? Probably not. If borg were to be integrated at all, it would need modification to accept named objects (sparsebak chunks) streamed into it and have some way of indicating an incremental backup so there is a namespace integration between successive backup sessions.

I've also done another test that should be a better indicator of relative speed. It uses a new 'qubes-ssh' protocol option I added so that loopdev + fuse layers aren't a factor. With this, sparsebak is consistently faster than borg over local 802.11n wifi for both initial and incremental backups. An assumption here is that adding encryption will not have a large impact -- but also keeping in mind sparsebak has no multiprocessing or optimizations as of yet.


Actually this is one of Sparsebak's strong points... very low interactivity
during remote operations.

But as far as I understand, to get most of it, you need
hardlink-compatible storage, which for example exclude most cloud
services...

Hardlinks are currently used for housekeeping operations (i.e. merge during pruning), but I didn't follow Apple's example to the extent that each incremental session must look like a whole volume (where they really lean on hardlinks). Instead, manifests are quickly assembled on an as-needed basis to create a meta-index for a complete volume view. So the question is: can session merging use a method without hardlinks? I think it could use 'move' -- and possibly gain encfs as an encryption option in the process.

As for targeting cloud storage, that will take some time on my part as its not something I normally use. Although its becoming less necessary, my original concept for the storage end was a Unix environment (GNU + python currently required) with contemporary mainstream filesystems that can handle and rapidly process large numbers of files. This is accessed via a qube or protocol like ssh. Maybe cloud storage apis could be targetted, but they might not be practical for volumes over a certain size.

--

Chris Laprise, tas...@posteo.net
https://github.com/tasket
https://twitter.com/ttaskett
PGP: BEE2 20C5 356E 764A 73EB  4AB3 1DC4 D106 F07F 1886

--
You received this message because you are subscribed to the Google Groups 
"qubes-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-devel+unsubscr...@googlegroups.com.
To post to this group, send email to qubes-devel@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-devel/16c1fd85-bd02-41bd-5d65-23b7934e8977%40posteo.net.
For more options, visit https://groups.google.com/d/optout.

Reply via email to