On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
> >I have been thinking a lot about de-duplication for a backup application
> >I am writing. I wrote a little script to figure out how much it would
> >save me. For my laptop home directory, about 100 GiB of data, it was a
> >couple of percent, depending a bit on the size of the chunks. With 4 KiB
> >chunks, I would save about two gigabytes. (That's assuming no MD5 hash
> >collisions.) I don't have VM images, but I do have a fair bit of saved
> >e-mail. So, for backups, I concluded it was worth it to provide an
> >option to do this. I have no opinion on whether it is worthwhile to do
> >in btrfs.
> 
> Online deduplication is very useful for backups of big,
> multi-gigabyte files which change constantly.
> Some mail servers store files this way; some MUA store the files
> like this; databases are also common to pack everything in big files
> which tend to change here and there almost all the time.
> 
> Multi-gigabyte files which only have few megabytes changed can't be
> hardlinked; simple maths shows that even compressing multiple files
> which have few differences will lead to greater space usage than a
> few megabytes extra in each (because everything else is
> deduplicated).
> 
> And I don't even want to think about IO needed to offline dedup a
> multi-terabyte storage (1 TB disks and bigger are becoming standard
> nowadays) i.e. daily, especially when the storage is already heavily
> used in IO terms.
> 
> 
> Now, one popular tool which can deal with small changes in files is
> rsync. It can be used to copy files over the network - so that if
> you want to copy/update a multi-gigabyte file which only has a few
> changes, rsync would need to transfer just a few megabytes.
> 
> On disk however, rsync creates a "temporary copy" of the original
> file, where it packs unchanged contents together with any changes
> made. For example, while it copies/updates a file, we will have:
> 
> original_file.bin
> .temporary_random_name
> 
> Later, original_file.bin would be removed, and
> .temporary_random_name would be renamed to original_file.bin. Here
> goes away any deduplication we had so far, we have to start the IO
> over again.

Sounds like all you need is cp --reflink=always and rsync --inplace.

Haven't tested is that works well, though.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to