On Fri, Dec 02, 2011 at 01:59:37AM +0100, Ragnar Sundblad wrote: > > I am sorry if these are dumb questions. If there are explanations > available somewhere for those questions that I just haven't found, please > let me know! :-)
I'll give you a brief summary. > 1. It has been said that when the DDT entries, some 376 bytes or so, are > rolled out on L2ARC, there still is some 170 bytes in the ARC to reference > them (or rather the ZAP objects I believe). In some places it sounds like > those 170 bytes refers to ZAP objects that contain several DDT entries. > In other cases it sounds like for each DDT entry in the L2ARC there must > be one 170 byte reference in the ARC. What is the story here really? Currently, every object (not just DDT entries) stored in L2ARC is tracked in memory. This metadata identifies the object and where on L2ARC it is stored. The L2ARC on-disk doesn't contain metadata and is not self-describing. This is one reason why the L2ARC starts out empty/cold after every reboot, and why the usable size of L2ARC is limited by memory. DDT entries in core are used directly. If the relevant DDT node is not in core, it must be fetched from the pool, which may in turn be assisted by an L2ARC. It's my understanding that, yes, several DDT entries are stored in each on-disk "block", though I'm not certain of the number. The on-disk size of the DDT entry is different, too. > 2. Deletion with dedup enabled is a lot heavier for some reason that I don't > understand. It is said that the DDT entries have to be updated for each > deleted reference to that block. Since zfs already have a mechanism for > sharing > blocks (for example with snapshots), I don't understand why the DDT has to > contain any more block references at all, or why deletion should be much > harder > just because there are checksums (DDT entries) tied to those blocks, and even > if they have to, why it would be much harder than the other block reference > mechanism. If anyone could explain this (or give me a pointer to an > explanation), I'd be very happy! DDT entries are reference-counted. Unlike other things that look like multiple references, these are truly block-level independent. Everything else is either tree-structured or highly aggregated (metaslab free-space tracking). Snapshots, for example, are references to a certain internal node (the root of a filesystem tree at a certain txg), and that counts as a reference to the entire subtree underneath. Note that any changes to this subtree later (via writes into the live filesystem) diverge completely via CoW; an update produces a new CoW block tree all the way back to the root, above the snapshot node. When a snapshot is created, it starts out owning (almost) nothing. As data is overwritten, the ownership of the data that might otherwise be freed is transferred to the snapshot. When the oldest snapshot is freed, any data blocks it owns can be freed. When an intermediate snapshot is freed, data blocks it owns are either transferred to the previous older snapshot because they were shared with it (txg < snapshot's) or they're unique to this snapshot and can be freed. Either way, these decisions are tree based and can potentially free large swathes of space with a single decision, whereas the DDT needs refcount updates individually for each block (in random order, as per below). (This is not the same as the ZPL directory tree used for naming, however, don't get those confused, it's flatter than that). > 3. I, as many others, would of course like to be able to have very large > datasets deduped without having to have enormous amounts of RAM. > Since the DDT is a AVL tree, couldn't just that entire tree be cached on > for example a SSD and be searched there without necessarily having to store > anything of it in RAM? That would probably require some changes to the DDT > lookup code, and some mechanism to gather the tree to be able to lift it > over to the SSD cache, and some other stuff, but still that sounds - with > my very basic (non-)understanding of zfs - like a not to overwhelming change. Think of this the other way round. One could do this, and could require a dedicated device (SSD) in order to use dedup at all. Now, every DDT lookup requires IO to bring the DDT entry into memory. This would be slow, so we could add an in-memory cache for the DDT... and we're back to square one. The major issue with the DDT is that, being context-hash indexed, it is random-access, even for sequential-access data. There's no getting around that, it's in its job description. > 4. Now and then people mention that the problem with bp_rewrite has been > explained, on this very mailing list I believe, but I haven't found that > explanation. Could someone please give me a pointer to that description > (or perhaps explain it again :-) )? This relates to the answer for 2; all the pointers in the tree discussed there are block pointers to device virtual addresses. If you're going to move data, you're going to change its address, which necessitates updating all the trees that reference it with new hashes. Several things make this tricky: - you're trying to follow references the wrong way, so there's a lot of tree-searching to be done, even just to find dependencies. Resolving those dependencies may be harder still with lots of combinatorial complexity and reverse searching for information. - you want to retain CoW semantics for safety of update in making the changes, yet the rest of the filesystem depends on the semantics of these blocks not changing. - as a result of the combination of the above, you may wind up with races/contention against live filesystem updates, scrubs and other errors/recoveries, and the need to add a lot of complex locking or other mechanism that's currently not needed. It's not impossible, but you will wind up touching lots of code and making all the tests much more complex. -- Dan.
pgpEDxTArFtn9.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss