On Mon, Dec 17, 2012 at 2:33 PM, Alexander Block <abloc...@gmail.com> wrote: > I did some research on deduplication in the past and there are some > problems that you will face. I'll try to list some of them (for sure > not all). > > On Mon, Dec 17, 2012 at 1:05 PM, Martin Křížek <martin.kri...@gmail.com> > wrote: >> Hello everyone, >> >> my name is Martin Krizek. I am a student at Faculty of Information, Brno >> University of Technology, Czech Republic. As my master's thesis I chose to >> work >> on Online Deduplication for Btrfs. >> >> My goal is to study Btrfs design, the offline deduplication patch [1] and to >> come up with a design for the online dedup, this semester. I will be >> implementing >> the feature next semester (spring, that is). > The offline dedup patch is quite old and won't apply/compile anymore. > You should probably look into the clone ioctl which basically does the > same as the extent_same ioctl from the patch. Based on the clone ioctl > you can at least learn how to "inject" existing extents into other > inodes.
I am aware of this, thanks for pointing that out though. >> >> I would like to shortly introduce my ideas on what I think the feature >> should look >> like. >> >> * Use cases >> The two main use cases for the dedup are: >> 1. virtual machine images >> 2. backup servers >> >> * When to dedup >> The choice of 'when to dedup' is not up to me as the master's thesis >> specification >> states "online". :) >> >> * What to dedup >> I'd say the most reasonable is block-level deduplication as it seems the most >> general approach to me. > Here you have two options: > 1. Based on the per volume tree where you'll find btrfs_file_extent > items which point to the global extent tree. > 2. Based on the global extent tree. By using the backref resolving > code, you can find which volumes refer to these extents. Agreed. >> >> * Controlling dedup >> - turn on/off deduplication - specify subvolumes on which >> deduplication is turned on >> (mount, ioctl - inherited), >> - turn on/off byte-by-byte comparison of blocks that have same hashes >> (mount, ioctl), >> - deduplication statistics (ioctl) > You'll get trouble when online dedup is turned on and off again. While > it is offline, extents still get written, but you won't have your hash > tree up-to-date. You'll need to find a way to update when dedup is > online again, without too much performance loos while updating. Well, yes. >> >> * Limitations >> Not really limitations, but this is a list of situations when dedup will not >> be triggered: >> - encryption, > I've already heard somewhere else that encryption+dedup is not > possible but I don't understand why. Can someone explain this > limitation? >> - compression - basically, dedup is kind of compression, might be worth to >> into >> it in the future though >> - inline/prealloc extents, > Should be possible to dedup inline extents, but must be configurable > (e.g. minimum block size). People should also be able to completely > disable it when performance is important. I agree that it's possible, but I don't think it's worth doing. You won't save much storage that way. >> - data across subvolumes > Should be possible. See my comment on the key in the hash tree. Again, I agree that it's possible but this would be something I probably won't go into. >> >> * How to store hashes >> The obvious choice would be to use the checksum tree that holds block >> checksums >> of each extent. The problem with the checksum tree is that the >> checksums are looked up by logical address for the start of the extent data. >> This is undesirable since it needs to be done the other way around. Logical >> addresses need to be looked up by a hash. >> To solve this, another key type would be created inside the checksum tree (or >> maybe better, a new tree would be introduced) that >> would have a hash as item's right-hand key value. This way, items could be >> looked up on a hash: >> (root, HASH_ITEM, hash) >> The root value says which root (subvolume) is hashed block stored on. The >> hash >> value is hash itself. > With the root inside the key you make it impossible to allow > cross-subvolume deduplication. Also, the offset field in the key that > you plan to use for the hash is only 64bit, so you can at best store a > part of the hash in the key. You should probably split the hash into 3 > parts: 64bit to put into the objectid field, 64bit to put into the > offset field and the remainder into the item data. A lookup would then > do the necessary splitting and in case a match is found also compare > the remainder found in the items data. Right, I thought of this but somehow forgot to mention it. Yes, the hash would need to be split if it does not fit into the offset. Thanks for noticing. >> The item data would be of the following structure: >> struct btrfs_hash_item { >> __le64 disk_bytenr; >> __le64 disk_num_bytes; >> __le64 offset; >> } > You could omit the offset here and and store the sum of disk_bytenr > and offset in one field. You could also omit the disk_num_bytes field > if you have a fixed dedup block length. The reason: No matter what you > store here, you never know if the referred extent is still existent > and still contains the same data. So when you found a possible > duplicate, a lookup into the extent tree is required to make sure it > is still valid. Probably the extents generation can help here. Good points. >> Fields disk_bytenr and disk_num_bytes would point to a >> extent item of the extent allocation tree. The offset is a offset of a hashed >> block in the extent. >> >> * When to trigger dedup exactly >> So It might be appropriate to delay the process just before data are >> written out to disk to give data the chance to be changed. I looked a bit >> into >> the compression code and it seems like dedup could use the similar aproach - >> to have threads that would do dedup on sync or general writeback. > It's important that you avoid unnecessary writes of duplicated blocks > in case you do delayed deduplication. See my final comment for > details. >> >> * Hash algorithm >> sha256 seems like a safe choice. > Doesn't fit into the dedup tree's key. See my above comments. >> >> * Future improvements >> There are some features that might be considered if time permits or as >> something >> that could be implemented in the future: >> - dedup across subvolumes >> - specify a hash algorithm >> - specify number of deduplication blocks >> >> >> I would appreciate any comments you might have, thanks! >> >> Best regards, >> Martin Krizek >> >> >> [1] http://lwn.net/Articles/422331/ >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > After my research I came to the conclusion that a real online > deduplication feature is not worth the trouble. The only advantage of > online deduplication compared to offline deduplication is that you can > avoid unnecessary writes for duplicate data and thus increase > performance in theory. In practice, I don't know if you really gain > much performance, because for small writes of duplicated data you > still have to write out meta data. In all cases, at least the inodes > extent items need to written, no matter if they point to new data or > duplicate data. The performance gain will only be measurable for large > duplicated blocks, in all other cases you'll loose performance due to > the overhead of deduplication. > > Storing the hashes inside a btrfs tree is also a bad idea. The hashes > are randomly distributed and thus have completely randomly ordered > keys. Lookup and insertion will be very slow when there is not enough > ram to hold the whole hash tree in memory...which is very likely to > happen. You should think about an alternative, e.g. storing the hash > table inside a dedicated inode (in a format that is more appropriate > for hash tables). But even with this solution, you'll get out of ram > very fast. You can use bloom filters or some comparable probabilistic > data structure to avoid most of the lookups, but then you still have a > lot of insertions because no matter if a duplicate is written or not, > you need to update the hash table for later lookups. > > Even if you find a good and fast solution for storing the hashes, > you'll get in trouble when data gets overwritten or deleted. Then you > have old hash values in the hash tree/table. This means, for each > lookup you'll also need to check if the found extent is still there > and unchanged. As an alternative, you could update the hash table > on-the-fly when data gets overwritten or deleted...but this again > means some additional lookups and writes and finally performance loss. > Also, when dedup is enabled for the first time (or after it was > disabled for some time), you have no hash table at all (or only a > partial out-of-date hash table)...what to do in this case? > > Next problem is...the backref resolving code gets very slow when there > are too many references to the same block. This will slow down > everything that depends on backref resolving, e.g. scrubbing (in case > of errors), send and deduplication itself. So you'll need to limit the > number of allowed references to each block, which may give poor > deduplication results. Probably the backref resolving code can be > optimized...I tried it but with limited results. > > Then you have fragmentation...which maybe can be ignored in some cases > (e.g. for SSD's). > > In my opinion, the only acceptable deduplication solution would be > either an offline or semi online solution which basically does > periodic offline deduplication. I'm a big fan of doing this in-kernel, > because then it can use a lot of information already present. For > example, the checksums that are already present in the checksum tree > can be used as a first phase hint. Some time ago I had a in-kernel > prototype running that used this approach: > > 1. Run through the checksum tree and do the following for each checksum: > a. Check if we (probably) have the checksum in bloom_filter1. > b. If yes, add the checksum to bloom_filter2 > c. In any case, add the checksum to bloom_filter1 > 2. Run through the checksum tree a seconds time and do the following > for each checksum: > a. Check if we (probably) have the checksum in bloom_filter2. > b. If yes, put the extent info (sha256, disk_bytenr, ...) into a > list with a defined maximum size. If the list gets full, sort by hash > and write out the previously collected extents to a dedicated inode. > Then start with a fresh empty list. > 3. We now have a dedicated file with multiple sorted blocks. Each > sorted block contains a list of extents. We can now do a very fast > merge sort like read and thus get a large stream of hash sorted > extents. While reading in the sorted extents, we can check if the > previous extent has the same hash as the current. If yes, we have a > duplicate. > I am not really following this as I am not familiar with bloom filters, I will need to get back to it later. Thanks for sharing. > I stopped working on that due to missing time. There are still a lot > of problems to solve with this solution which I won't go into detail > now. The main advantages of this solution are low memory consumption > and time mostly being dependent on the number of duplicates instead of > the whole filesystem size as there is no need to hash the whole > filesystem. Also, you don't have a permanent large table on-disk, as > you can delete the dedicated inodes after you're finished. Thanks for your comments! Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html