On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote:
> On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
> > >I have been thinking a lot about de-duplication for a backup application
> > >I am writing. I wrote a little script to figure out how much it would
> > >save me. For my laptop home directory, about 100 GiB of data, it was a
> > >couple of percent, depending a bit on the size of the chunks. With 4 KiB
> > >chunks, I would save about two gigabytes. (That's assuming no MD5 hash
> > >collisions.) I don't have VM images, but I do have a fair bit of saved
> > >e-mail. So, for backups, I concluded it was worth it to provide an
> > >option to do this. I have no opinion on whether it is worthwhile to do
> > >in btrfs.
> > 
> > Online deduplication is very useful for backups of big,
> > multi-gigabyte files which change constantly.
> > Some mail servers store files this way; some MUA store the files
> > like this; databases are also common to pack everything in big files
> > which tend to change here and there almost all the time.
> > 
> > Multi-gigabyte files which only have few megabytes changed can't be
> > hardlinked; simple maths shows that even compressing multiple files
> > which have few differences will lead to greater space usage than a
> > few megabytes extra in each (because everything else is
> > deduplicated).
> > 
> > And I don't even want to think about IO needed to offline dedup a
> > multi-terabyte storage (1 TB disks and bigger are becoming standard
> > nowadays) i.e. daily, especially when the storage is already heavily
> > used in IO terms.
> > 
> > 
> > Now, one popular tool which can deal with small changes in files is
> > rsync. It can be used to copy files over the network - so that if
> > you want to copy/update a multi-gigabyte file which only has a few
> > changes, rsync would need to transfer just a few megabytes.
> > 
> > On disk however, rsync creates a "temporary copy" of the original
> > file, where it packs unchanged contents together with any changes
> > made. For example, while it copies/updates a file, we will have:
> > 
> > original_file.bin
> > .temporary_random_name
> > 
> > Later, original_file.bin would be removed, and
> > .temporary_random_name would be renamed to original_file.bin. Here
> > goes away any deduplication we had so far, we have to start the IO
> > over again.
> 
> Sounds like all you need is cp --reflink=always and rsync --inplace.
> 
> Haven't tested is that works well, though.

It works very well, btrfs with snapshots, compression and rsync --inplace has 
better storage utilisation than lessfs at around 10-15 snapshots with around 
600GB of test data in small files.

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to