Re: Data De-duplication

2008-12-14 Thread Omen Wild
Quoting Oliver Mattos on Thu, Dec 11 00:18: > > > It would be interesting to see how many duplicate *blocks* there are > > across the filesystem, agnostic to files... Here is my contribution. It's a perl script that goes through every block (of various block sizes) on a device and prints out sum

Re: Data De-duplication

2008-12-14 Thread Chris Samuel
On Thu, 11 Dec 2008 8:19:03 am Ray Van Dolson wrote: > I'm not sure why this hasn't caught on, but as soon as a solid and fast > implementation of it exists in the Linux world I really think it can > catch on for VM datastores I know we've hollered at Sun as to why > they haven't rolled it out

Re: Data De-duplication

2008-12-11 Thread Chris Mason
On Wed, 2008-12-10 at 17:53 +, Oliver Mattos wrote: > > > 2) Keep a tree of checksums for data blocks, so that a bit of data can > > > be located by it's checksum. Whenever a data block is about to be > > > written check if the block matches any known block, and if it does then > > > don't bo

Re: Data De-duplication

2008-12-11 Thread Oliver Mattos
> Neat. Thanks much. It'd be cool to output the results of each of your > hashes to a database so you can get a feel for how many duplicate > blocks there are cross-files as well. > > I'd like to run this in a similar setup on all my VMware VMDK files and > get an idea of how much space savings

Re: Data De-duplication

2008-12-10 Thread Ray Van Dolson
On Thu, Dec 11, 2008 at 03:42:58AM +, Oliver Mattos wrote: > Here is a script to locate duplicate data WITHIN files: > > On some test file sets of binary data with no duplicated files, about 3% > of the data blocks were duplicated, and about 0.1% of the data blocks > were nulls. The data was

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
Here is a script to locate duplicate data WITHIN files: On some test file sets of binary data with no duplicated files, about 3% of the data blocks were duplicated, and about 0.1% of the data blocks were nulls. The data was mainly elf and win32 binaries plus some random game data, office document

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
> It would be interesting to see how many duplicate *blocks* there are > across the filesystem, agnostic to files... > > Is this somthing your script does Oliver? My script doesn't yet exist, although when created it would, yes. I was thinking of just making a BASH script and using dd to extrac

Re: Data De-duplication

2008-12-10 Thread Ray Van Dolson
On Wed, Dec 10, 2008 at 01:57:54PM -0800, Tracy Reed wrote: > On Wed, Dec 10, 2008 at 09:42:16PM +, Oliver Mattos spake thusly: > > I'm considering writing that script to test on my ext3 disk just to see > > how much duplicate wasted data I really have. > > Check out the fdupes command. In Fed

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
On Wed, 2008-12-10 at 13:57 -0800, Tracy Reed wrote: > On Wed, Dec 10, 2008 at 09:42:16PM +, Oliver Mattos spake thusly: > > I'm considering writing that script to test on my ext3 disk just to see > > how much duplicate wasted data I really have. > > Check out the fdupes command. In Fedora 8 i

Re: Data De-duplication

2008-12-10 Thread Tracy Reed
On Wed, Dec 10, 2008 at 09:42:16PM +, Oliver Mattos spake thusly: > I'm considering writing that script to test on my ext3 disk just to see > how much duplicate wasted data I really have. Check out the fdupes command. In Fedora 8 it is in the yum repo as fdupes-1.40-10.fc8 -- Tracy Reed http

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
I see quite a few uses for this, and while it looks like the kernel mode automatic de-dup-on-write code might be performance costly, require disk format changes, and be controversial, it sounds like the user mode utility could be implemented today. It looks like a simple script could do the job -

Re: Data De-duplication

2008-12-10 Thread Ray Van Dolson
I lost the original post so I'm jumping in at the wrong thread-point :) Someone mentioned that the primary usage of de-dup is in the backup realm. True perhaps currently, but de-dup IMO is *the* killer app in the world of virtualization and is a huge reason why we're picking NetApp at work to back

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts wrote: > > When the a direct read > > comparison is required before sharing blocks, it is probably best done > > by a stand alone utility, since we don't want wait for a read of a full > > extent every time we want to write on. > > Can a stand-alo

Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
> > 2) Keep a tree of checksums for data blocks, so that a bit of data can > > be located by it's checksum. Whenever a data block is about to be > > written check if the block matches any known block, and if it does then > > don't bother duplicating the data on disk. I suspect this option may >

Re: Data De-duplication

2008-12-10 Thread seth huang
De-duplication is useful in data backup systems because of the high level of data redundancy, but I'm not sure whether it is necessary for a general-purposed fs. If you really want to do so, I will suggest the latter one. File-level de-dup can be implemented in a user-level application. As I go

Re: Data De-duplication

2008-12-10 Thread Chris Mason
On Tue, 2008-12-09 at 22:48 +, Oliver Mattos wrote: > Hi, > > Say I download a large file from the net to /mnt/a.iso. I then download > the same file again to /mnt/b.iso. These files now have the same > content, but are stored twice since the copies weren't made with the bcp > utility. > >

Re: Data De-duplication

2008-12-10 Thread Miguel Figueiredo Mascarenhas Sousa Filipe
Hi all, On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos <[EMAIL PROTECTED]> wrote: > Hi, > > Say I download a large file from the net to /mnt/a.iso. I then download > the same file again to /mnt/b.iso. These files now have the same > content, but are stored twice since the copies weren't made wit

Data De-duplication

2008-12-09 Thread Oliver Mattos
Hi, Say I download a large file from the net to /mnt/a.iso. I then download the same file again to /mnt/b.iso. These files now have the same content, but are stored twice since the copies weren't made with the bcp utility. The same occurs if a directory tree with duplicate files (created with b