On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote: > On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: > > >I have been thinking a lot about de-duplication for a backup application > > >I am writing. I wrote a little script to figure out how much it would > > >save me. For my laptop home directory, about 100 GiB of data, it was a > > >couple of percent, depending a bit on the size of the chunks. With 4 KiB > > >chunks, I would save about two gigabytes. (That's assuming no MD5 hash > > >collisions.) I don't have VM images, but I do have a fair bit of saved > > >e-mail. So, for backups, I concluded it was worth it to provide an > > >option to do this. I have no opinion on whether it is worthwhile to do > > >in btrfs. > > > > Online deduplication is very useful for backups of big, > > multi-gigabyte files which change constantly. > > Some mail servers store files this way; some MUA store the files > > like this; databases are also common to pack everything in big files > > which tend to change here and there almost all the time. > > > > Multi-gigabyte files which only have few megabytes changed can't be > > hardlinked; simple maths shows that even compressing multiple files > > which have few differences will lead to greater space usage than a > > few megabytes extra in each (because everything else is > > deduplicated). > > > > And I don't even want to think about IO needed to offline dedup a > > multi-terabyte storage (1 TB disks and bigger are becoming standard > > nowadays) i.e. daily, especially when the storage is already heavily > > used in IO terms. > > > > > > Now, one popular tool which can deal with small changes in files is > > rsync. It can be used to copy files over the network - so that if > > you want to copy/update a multi-gigabyte file which only has a few > > changes, rsync would need to transfer just a few megabytes. > > > > On disk however, rsync creates a "temporary copy" of the original > > file, where it packs unchanged contents together with any changes > > made. For example, while it copies/updates a file, we will have: > > > > original_file.bin > > .temporary_random_name > > > > Later, original_file.bin would be removed, and > > .temporary_random_name would be renamed to original_file.bin. Here > > goes away any deduplication we had so far, we have to start the IO > > over again. > > Sounds like all you need is cp --reflink=always and rsync --inplace. > > Haven't tested is that works well, though.
It works very well, btrfs with snapshots, compression and rsync --inplace has better storage utilisation than lessfs at around 10-15 snapshots with around 600GB of test data in small files. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html