On Wed, Jul 22, 2015 at 09:49:52AM +0800, Qu Wenruo wrote: > Change subject to reflect the core of the conversation. > > Zygo Blaxell wrote on 2015/07/21 18:14 -0400: > >On Tue, Jul 21, 2015 at 02:52:38PM +0800, Qu Wenruo wrote: > >>Zygo Blaxell wrote on 2015/07/21 00:55 -0400:
> >There's already a read-and-compare in the extent-same ioctl. :-/ > Yes, but the extent same ioctl needs 2 inode. (one already provided by ioctl > system call) > One for the source and one for the dest, which makes codes easy to implement > as we can just use VFS calls to read the needed pages. > > But the problem is, for dedup, there is only hash <-> extent mapping. > We only know the corresponding bytenr of a hash, not a inode. > > If even using the read-compare method, which means at write routine, we will > resuse read routine, causing more complicated read/write path combination to > test, and later read routine change will be quite dirty to avoid impact on > write routine. > > And what's worse is, an extent can be referred by several different inodes, > so even for the best case, we need to find an inode using backref and then > use address space calls to read the neede pages. > > That will leads to the hell of delayed ref. > As at write time, backref is delayed, it needs not only search the extent > tree for backref item but also go through delayed refs. (Such things were > the reason causing crazy quota numbers before and I tried to fix in 4.2-rc1) > > And that's what I feared to face, complicated read/write coupling with > backref search nightmare. Yes, putting a small-memory dedup into the write path on Linux does cause some challenging (some might say insurmountable ;) ) problems. A natural result of having a userspace daemon is that it can keep up with writes but it is not on the critical path for them, i.e. if we have memory pressure we just throw the blocks at the disk to free up some RAM, but remember the block addresses so they can be rescanned later when we're more idle. A daemon with a persistent store and a find-new style scanner can be paused and allowed to run overnight if it's too heavy to run live. I could see such a facility living inside the kernel and being started and stopped by an admin...but it might be a better idea to just adjust the kernel's interface to communicate better with a dedup daemon. e.g. I have a wishlist item to have something like fanotify which gives me a file descriptor when a file is modified--but that also includes dirty block numbers. I can connect those modified blocks with their duplicate on-disk counterparts almost immediately, before delalloc even considers writing them. > >That's precisely the tradeoff I made: up to 0.1% more I/O for unique > >blocks, and duplicate writes would become reads instead (assuming I moved > >my dedup daemon into the kernel). There is some extra seeking on writes > >to confirm duplicate pages--but when they are confirmed, the writes are > >no longer necessary and their I/O can be dropped. > > > >The payoff is that it requires 100 times less RAM, so it can coexist > >with other things on a medium-sized machine. > > Quite convincing trade off now. > > But the above extent reading problem is still here. > Although I may also be wrong as I'm not quite similar with VFS and filemap. > > If there is any good kernel API to read pages directly by bytenr without > using filemap/address space infrastructure, I'll be quite happy to try it. I had assumed (but not checked) that the kernel can do it more easily than userspace. The ugliest pieces of code for userspace dedup are the pieces that translate a block number into a path into a file descriptor, then have to verify that they ended up with the same physical block that was requested. So another wishlist item would be an ioctl for "give me a file descriptor containing a ref to this blocknr without having to go through all this root-inode-path resolution crap." ;) > >My target environment is a file server device where up to 75% of the > >system RAM might be devoted to dedup. It is not something that is > >intended to run unnoticed on a smartphone. > The original thought on persist hash map is for database, hard to imagine > right? > > From what I read from Oracle datasheet, they claim their database can take > full use of ZFS deduplication, so I consider such persist hash map idea as a > supplement to the uncertain LRU drop. > But that's just a future idea yet... I looked at a number of databases, hash table structures, bloom filters, cuckoo tables, exotic modern C++ hashtable libraries from Google, and ancient articles on hash tables by Knuth. In the end I gave up and rolled my own. I have a lot of duplicate data--almost half--so I can't use algorithms designed for a 10% hit rate (like bloom and cuckoo). I don't want to scan the entire filesystem from scratch every time, so Bloom was out (no way to delete items, and the variations on Bloom that allow deletion require almost as much space as the hash table itself). Indexing overhead for b-trees and most database engines would be larger than the hash table. With a 50% duplicate hit rate, _any_ algorithm that spilled out to disk would be hitting the disk with a random I/O for every 8K of data. This meant that any data that wasn't going to fit into RAM might as well be on the moon. If I tried to keep it, I'd be swapping all the time. I found a way to work in constant space, because RAM was all the storage I'd ever be able to use. Persistence is trivial: just dump the RAM out to disk every few hours. > BTW, is your userspace dedup daemon open-source? > It would be quite nice if we can refer it to improve further. It is open-source in as much as it is source at all, but it's not in a usable state yet. Right now it is a pile of command-line utilities, several experimental test programs which perform different parts of the task, some ad-hoc Perl scripts to glue them together, and test-driven rework that I am slowly converting into a coherent working program that other human beings can read and use. The "persistent" on-disk format changes every three weeks. I'm on my fourth rewrite of the extent-matching loop because the first three ran into logical landmines in my test data. It was originally supposed to be a 1000-line replacement for faster-dupemerge to be finished in a week. This is now month 14, it has crossed the 2000 line mark, and I'm hoping this month will be the first month that I rewrite _less_ than 30% of the code. :-P > Thanks, > Qu > > > >>In fact, I can reduce the space usage by just(not so easy to call it just > >>anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 512K > >>other than original 128 dedup_hash for 512K. > >>That's also why Liu Bo uses tunable dedup_size. > > > >Anyone can increase the block size to make their dedup faster. We lose > >100% of all duplicate extents that are smaller than (or just not aligned > >to) the block size. > > > >>Anyway, my dedup_hash struct is really huge... > >>It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page > >>yet). > > > >I'm told ZFS is closer to 320. ;) > > > >>>At about 157 bytes per 512K > >>>(the amount of RAM I have to run dedup) the dedup still covers 92% > >>>of duplicate extents with length 64K, with less coverage for smaller > >>>extents and more coverage for larger extents. The size distribution > >>>of typical filesystems means I get 99% of all duplicate blocks. > >>> > >>Pretty impressive, but still a little worried about the code complexibility > >>especially when it needs to go into kernel. > >> > >>At last, thanks for your impressive dedup ideas and test data. > >> > >>Thanks, > >>Qu > >>-- > >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>the body of a message to majord...@vger.kernel.org > >>More majordomo info at http://vger.kernel.org/majordomo-info.html >
signature.asc
Description: Digital signature