Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Chris, > It is a counter and a back reference. With Yan Zheng's new format > work, the limit is not 2^64. That means that there is one back reference for every use of the block? Where is this back reference stored? (I'm asking because if one back reference for every copy is stored, it can obviou

Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Hello Chris, > > My question is now, how often can a block in btrfs be refferenced? > The exact answer depends on if we are referencing it from a single > file or from multiple files. But either way it is roughly 2^32. could you please explain to me what underlying datastructure is used to mon

Re: Data Deduplication with the help of an online filesystem check

2009-05-24 Thread Thomas Glanzmann
Hello Heinz, > Hi, during the last half year I thought a little bit about doing dedup > for my backup program: not only with fixed blocks (which is > implemented), but with moving blocks (with all offsets in a file: 1 > byte, 2 byte, ...). That means, I have to have *lots* of comparisions > (size

Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Thomas Glanzmann
Hello Jan, * Jan-Frode Myklebust [090504 20:20]: > "thin or shallow clones" sounds more like sparse images. I believe > "linked clones" is the word for running multiple virtual machines off > a single gold image. Ref, the "VMware View Composer" section of: not exactly. VMware has one golden imag

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Ric, > I would not categorize it as offline, but just not as inband (i.e., you can > run a low priority background process to handle dedup). > Offline windows are extremely rare in production sites these days and > it could take a very long time to do dedup at the block level over a > large file

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Andrey, > As far as I understand, VMware already ships this "gold image" feature > (as they call it) for Windows environments and claims it to be very > efficient. they call it ,,thin or shallow clones'' and ship it with desktop virtualization (one vm per thinclient user) and for VMware lab

Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Ric, > (1) Block level or file level dedup? what is the difference between the two? > (2) Inband dedup (during a write) or background dedup? I think inband dedup is way to intensive on ressources (memory) and also would kill every performance benchmark. So I think the offline dedup is the

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris, > Your database should know, and the ioctl could check to see if the > source and destination already point to the same thing before doing > anything expensive. I see. > > So, if I only have file, offset, len and not the block number, is there > > a way from userland to tell if two

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris, > But, in your ioctls you want to deal with [file, offset, len], not > directly with block numbers. COW means that blocks can move around > without you knowing, and some of the btrfs internals will COW files in > order to relocate storage. > So, what you want is a dedup file (or fil

Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris, > You can start with the code documentation section on > http://btrfs.wiki.kernel.org I read through this and at the moment one questions come in my mind: http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png Looking at this picture, when I'm going to implement the ded

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, > They are, but only the crc32c are stored today. maybe crc32c is good enough to identify duplicated blocks, I mean we only need a hint, the dedup ioctl does the double checking. I will write tomorrow a perl script and compare the results to the one that uses md5 and repoort back. >

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, > > - Implement a system call that reports all checksums and unique > > block identifiers for all stored blocks. > This would require storing the larger checksums in the filesystem. It > is much better done in the dedup program. I think I misunderstood something here. I

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz, > I wrote a backup tool which uses dedup, so I know a little bit about > the problem and the performance impact if the checksums are not in > memory (optionally in that tool). > http://savannah.gnu.org/projects/storebackup > Dedup really helps a lot - I think more than I could imagin

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, * Thomas Glanzmann [090428 22:10]: > exactly. And if there is a way to retrieve the already calculated > checksums from kernel land, than it would be possible to implement a > ,,systemcall'' that gives the kernel a hint of a possible duplicated > block (like provid

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, > Not today. The sage developers sent a patch to make an ioctl for this, > but since it was hard coded to crc32c I haven't taken it yet. could you send me the patch, I would love to make it work for arbritrary checksums and resubmit. Thomas -- To unsubscribe from this list: send

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz, > It's not only cpu time, it's also memory. You need 32 byte for each 4k > block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, > Yes, but for the purposes of dedup, it's not exactly what you want. > You want an index by checksum, and the current btrfs code indexes by > logical byte number in the disk. that would be good for online dedup, but in practice that is not going to work or I don't see how. > So you

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Michael, > I'd start with a crc32 and/or MD5 to find candidate blocks, then do a > bytewise comparison before actually merging them. Even the risk of an > accidental collision is too high, and considering there are plenty of > birthday-style MD5 attacks it would not be extraordinarily dif

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, > Right now the blocksize can only be the same as the page size. For > this external dedup program you have in mind, you could use any > multiple of the page size. perfect. Exactly what I need. > Three days is probably not quite enough ;) I'd honestly prefer the > dedup happen ent

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, > It is possible, there's room in the metadata for about about 4k of > checksum for each 4k of data. The initial btrfs code used sha256, but > the real limiting factor is the CPU time used. I see. There a very efficient md5 algorithms out there, for example, especially if the code is writ

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, > > Is there a checksum for every block in btrfs? > Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? > > Is it possible to retrieve these checksums from userland? > Not today. The sage developers sent a patch to make an ioctl for > t

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, > I wouldn't rely on crc32: it is not a strong hash, > Such deduplication can lead to various problems, > including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without cha

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Tomasz, > Did you just compare checksums, or did you also compare the data "bit > after bit" if the checksums matched? no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. Thomas -- To unsubscribe from t

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, I have a few more questions to this: - Is there a checksum for every block in btrfs? - Is it possible to retrieve these checksums from userland? - Is it possible to use a blocksize of 4 or 8 kbyte with btrfs? To get a bit more specific: If it is relatively easy to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Chris, what blocksizes can I choose with btrfs? Do you think that it is possible for an outsider like me to submit patches to btrfs which enable dedup in three fulltime days? Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord

Re: Data Deduplication with the help of an online filesystem check

2009-04-27 Thread Thomas Glanzmann
Hello Chris, > There is a btrfs ioctl to clone individual files, and this could be used > to implement an online dedup. But, since it is happening from userland, > you can't lock out all of the other users of a given file. > So, the dedup application would be responsible for making sure a given

Data Deduplication with the help of an online filesystem check

2009-04-26 Thread Thomas Glanzmann
Hello, I would like to know if it would be possible to implement the following feature in btrfs: Have an online filesystem check which accounts for possible duplicated data blocks (maybe with the help of already implemented checksums: Are these checksums for the whole file or block based?) and de