On 12/22/2018 08:48 PM, Marek Marczykowski-Górecki wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
On Fri, Dec 21, 2018 at 08:39:43AM -0500, Chris Laprise wrote:
On 12/20/2018 09:40 PM, Marek Marczykowski-Górecki wrote:
Thanks for doing this!
I haven't really looked at the code, but I have more generic comment:
The idea of small, frequent snapshots to collect modified blocks bitmaps
is neat. But I'd really, really like to avoid inventing yet another
backup archive format. The current qubes backup format have its own
limitations and while I have some ideas[1] how to plug incremental
backups there, I don't think there is a future in that. On the other
hand, there are already solutions using very similar approach for
handling incremental backup (basically, do not differentiate between
"full", "incremental" and "differential" backups, but split data set
into chunks and send only those not already present in backup archive).
And those already have established formats, including encryption and
integrity protection. Specifically, I'm looking into two of them:
- duplicity
- BorgBackup
I think its about time someone in open source created an analog to the Time
Machine sparsebundle format, just because its so effective and _simple_:
Fixed size chunks of the volume stored as files with filenames representing
addresses, and manifests with sha256 hashes. There's scarcely anything more
to it than that, and its simple enough to be processed by shell commands
like find, cp and zcat (see 'spbk-assemble' for a functional example).
This already works on millions of Mac systems where people expect it to
provide hourly backups without noticeably affecting system resources. This
class of format I don't mind creating; I think Apple chose well.
Can you point at the documentation of encryption scheme used by
Time Machine backups?
This is the weakest part of my effort so far: Scant planning for
encryption (and IANAC). Its the one struggle I see coming with this project.
I recognize the problem as one involving block-based ciphers/modes and
the level of resistance they offer against any spy who can view
successive chunk updates. My understanding of the Time Machine method is
that its similar if not identical to encryption of a normal disk volume
(or a 'normal' loop dev that happens to be in chunks). If so, I may try
to implement something close to it using python aes but will still seek
input from a cryptographer which should be done in any case.
FWIW, I've considered trying a modified version of the scrypt method
used in qvm-backup. Sparsebak can be used in a tarfile mode, for
instance, which makes this practical but has the side-effect of removing
pruning ability.
Also note that we'd like to have at least some level of hiding metadata
- - like VM names (leaked through file names).
I have an idea for a relatively simple obfuscation layer that could even
re-order the transmission of chunks in addition to concealing filenames.
It would use an additional index with randomized names and the order
shuffled. Implementing this, I surmise, could improve robustness of the
encryption.
As for borg, I'm not sure a heavy emphasis on deduplication is appropriate
for many PC applications. Its a resource drain that leads to complex archive
formats on the back end. And my initial testing suggests the dedup efficacy
is oversold: Sparsebak can sometimes produce smaller multi-generation
archives even without dedup.
Not arguing with this. I think borg could be good enough in our case
with fixed-size chunks, using your way of detecting what have changed.
Deduplication here would be mostly about re-using old chunks (already in
backup archive) for new backup - so, the "incremental" part.
I just want to avoid re-inventing compressed and encrypted archive
format (a mistake we've made before). Borg already have established
format for that.
Yes, keeping in mind the chunk size I'm using currently is 128kB with
fixed boundaries. I've experimented with simple retroactive dedup based
on sorting the manifest hashes and that can save a little space with
almost no time/power cost. This could be done at send time to save
bandwidth, but that savings may not be worth it. OTOH, if we expect some
users to backup related cloned VMs (common with templates) the potential
savings then becomes very significant even with this simple method.
To be sure, borg gets better dedup with arbitrary data input, but even
so that looks to be around 2-4%. Would I work days to add that
efficiency to sparsebak's COW awareness? Probably. Months? Probably not.
If borg were to be integrated at all, it would need modification to
accept named objects (sparsebak chunks) streamed into it and have some
way of indicating an incremental backup so there is a namespace
integration between successive backup sessions.
I've also done another test that should be a better indicator of
relative speed. It uses a new 'qubes-ssh' protocol option I added so
that loopdev + fuse layers aren't a factor. With this, sparsebak is
consistently faster than borg over local 802.11n wifi for both initial
and incremental backups. An assumption here is that adding encryption
will not have a large impact -- but also keeping in mind sparsebak has
no multiprocessing or optimizations as of yet.
Actually this is one of Sparsebak's strong points... very low interactivity
during remote operations.
But as far as I understand, to get most of it, you need
hardlink-compatible storage, which for example exclude most cloud
services...
Hardlinks are currently used for housekeeping operations (i.e. merge
during pruning), but I didn't follow Apple's example to the extent that
each incremental session must look like a whole volume (where they
really lean on hardlinks). Instead, manifests are quickly assembled on
an as-needed basis to create a meta-index for a complete volume view. So
the question is: can session merging use a method without hardlinks? I
think it could use 'move' -- and possibly gain encfs as an encryption
option in the process.
As for targeting cloud storage, that will take some time on my part as
its not something I normally use. Although its becoming less necessary,
my original concept for the storage end was a Unix environment (GNU +
python currently required) with contemporary mainstream filesystems that
can handle and rapidly process large numbers of files. This is accessed
via a qube or protocol like ssh. Maybe cloud storage apis could be
targetted, but they might not be practical for volumes over a certain size.
--
Chris Laprise, tas...@posteo.net
https://github.com/tasket
https://twitter.com/ttaskett
PGP: BEE2 20C5 356E 764A 73EB 4AB3 1DC4 D106 F07F 1886
--
You received this message because you are subscribed to the Google Groups
"qubes-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to qubes-devel+unsubscr...@googlegroups.com.
To post to this group, send email to qubes-devel@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/qubes-devel/16c1fd85-bd02-41bd-5d65-23b7934e8977%40posteo.net.
For more options, visit https://groups.google.com/d/optout.