On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: > >I have been thinking a lot about de-duplication for a backup application > >I am writing. I wrote a little script to figure out how much it would > >save me. For my laptop home directory, about 100 GiB of data, it was a > >couple of percent, depending a bit on the size of the chunks. With 4 KiB > >chunks, I would save about two gigabytes. (That's assuming no MD5 hash > >collisions.) I don't have VM images, but I do have a fair bit of saved > >e-mail. So, for backups, I concluded it was worth it to provide an > >option to do this. I have no opinion on whether it is worthwhile to do > >in btrfs. > > Online deduplication is very useful for backups of big, > multi-gigabyte files which change constantly. > Some mail servers store files this way; some MUA store the files > like this; databases are also common to pack everything in big files > which tend to change here and there almost all the time. > > Multi-gigabyte files which only have few megabytes changed can't be > hardlinked; simple maths shows that even compressing multiple files > which have few differences will lead to greater space usage than a > few megabytes extra in each (because everything else is > deduplicated). > > And I don't even want to think about IO needed to offline dedup a > multi-terabyte storage (1 TB disks and bigger are becoming standard > nowadays) i.e. daily, especially when the storage is already heavily > used in IO terms. > > > Now, one popular tool which can deal with small changes in files is > rsync. It can be used to copy files over the network - so that if > you want to copy/update a multi-gigabyte file which only has a few > changes, rsync would need to transfer just a few megabytes. > > On disk however, rsync creates a "temporary copy" of the original > file, where it packs unchanged contents together with any changes > made. For example, while it copies/updates a file, we will have: > > original_file.bin > .temporary_random_name > > Later, original_file.bin would be removed, and > .temporary_random_name would be renamed to original_file.bin. Here > goes away any deduplication we had so far, we have to start the IO > over again.
Sounds like all you need is cp --reflink=always and rsync --inplace. Haven't tested is that works well, though. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html