On Fri, May 13, 2016 at 08:14:10AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-12 16:54, Mark Fasheh wrote:
> >Now ask yourself the question - would you accept a write cache which is
> >expensive to fill and would only have a hit rate of less than 5%?
> In-band deduplication is a feature that is not used by typical desktop users
> or even many developers because it's computationally expensive, but it's
> used _all the time_ by big data-centers and similar places where processor
> time is cheap and storage efficiency is paramount. Deduplication is more
> useful in general the more data you have.  5% of 1 TB is 20 GB, which is not
> much.  5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be
> used for storing more data, or providing better resiliency against failures.

There's also a big cost difference between a 5-drive and 6-drive array
when all your servers are built with 5-drive cages.  Delay that expansion
for six months, and you can buy 5 larger drives for the same price.

I have laptops with filesystems that are _just_ over 512GB before dedup.
SSDs seem to come in 512GB and 1TB sizes with nothing in between, so
saving even a few dozen MB can translate into hundreds of dollars per
user (not to mention the bureaucratic hardware provisioning side-effects
which easily triple that cost).

Working on big software integration projects can generate 60% duplication
rates.  Think checkout of entire OS + disk images + embedded media,
build + install trees with copied data, long build times so there are
multiple working directories, multiple active branches of the same
project, and oh yeah it's all in SVN, which stores a complete second
copy of everything on disk just because wasting space is cool.

> To look at it another way, deduplicating an individual's home directory will
> almost never get you decent space savings, the majority of shared data is
> usually file headers and nothing more, which can't be deduplicated
> efficiently because of block size requirements. Deduplicating all the home
> directories on a terminal server with 500 users usually will get you decent
> space savings, as there very likely are a number of files that multiple
> people have exact copies of, but most of them are probably not big files.
> Deduplicating the entirety of a multi-petabyte file server used for storing
> VM disk images will probably save you a very significant amount of space,
> because the probability of having data that can be deduplicated goes up as
> you store more data, and there is likely to be a lot of data shared between
> the disk images.

Mail stores benefit from this too.  For some reason, Microsoft Office
users seem to enjoy emailing multi-megabyte runs of MIME-encoded zeros
to each other.  ;)

Attachment: signature.asc
Description: Digital signature

Reply via email to