On Fri, May 13, 2016 at 08:14:10AM -0400, Austin S. Hemmelgarn wrote: > On 2016-05-12 16:54, Mark Fasheh wrote: > >Now ask yourself the question - would you accept a write cache which is > >expensive to fill and would only have a hit rate of less than 5%? > In-band deduplication is a feature that is not used by typical desktop users > or even many developers because it's computationally expensive, but it's > used _all the time_ by big data-centers and similar places where processor > time is cheap and storage efficiency is paramount. Deduplication is more > useful in general the more data you have. 5% of 1 TB is 20 GB, which is not > much. 5% of 1 PB is 20 TB, which is at least 3-5 disks, which can then be > used for storing more data, or providing better resiliency against failures.
There's also a big cost difference between a 5-drive and 6-drive array when all your servers are built with 5-drive cages. Delay that expansion for six months, and you can buy 5 larger drives for the same price. I have laptops with filesystems that are _just_ over 512GB before dedup. SSDs seem to come in 512GB and 1TB sizes with nothing in between, so saving even a few dozen MB can translate into hundreds of dollars per user (not to mention the bureaucratic hardware provisioning side-effects which easily triple that cost). Working on big software integration projects can generate 60% duplication rates. Think checkout of entire OS + disk images + embedded media, build + install trees with copied data, long build times so there are multiple working directories, multiple active branches of the same project, and oh yeah it's all in SVN, which stores a complete second copy of everything on disk just because wasting space is cool. > To look at it another way, deduplicating an individual's home directory will > almost never get you decent space savings, the majority of shared data is > usually file headers and nothing more, which can't be deduplicated > efficiently because of block size requirements. Deduplicating all the home > directories on a terminal server with 500 users usually will get you decent > space savings, as there very likely are a number of files that multiple > people have exact copies of, but most of them are probably not big files. > Deduplicating the entirety of a multi-petabyte file server used for storing > VM disk images will probably save you a very significant amount of space, > because the probability of having data that can be deduplicated goes up as > you store more data, and there is likely to be a lot of data shared between > the disk images. Mail stores benefit from this too. For some reason, Microsoft Office users seem to enjoy emailing multi-megabyte runs of MIME-encoded zeros to each other. ;)
signature.asc
Description: Digital signature