On Fri, Dec 02, 2011 at 01:59:37AM +0100, Ragnar Sundblad wrote:
> 
> I am sorry if these are dumb questions. If there are explanations
> available somewhere for those questions that I just haven't found, please
> let me know! :-)

I'll give you a brief summary.

> 1. It has been said that when the DDT entries, some 376 bytes or so, are
> rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
> them (or rather the ZAP objects I believe). In some places it sounds like
>  those 170 bytes refers to ZAP objects that contain several DDT entries.
> In other cases it sounds like for each DDT entry in the L2ARC there must
> be one 170 byte reference in the ARC. What is the story here really?

Currently, every object (not just DDT entries) stored in L2ARC is
tracked in memory. This metadata identifies the object and where on
L2ARC it is stored. The L2ARC on-disk doesn't contain metadata and is
not self-describing. This is one reason why the L2ARC starts out
empty/cold after every reboot, and why the usable size of L2ARC is
limited by memory.

DDT entries in core are used directly.  If the relevant DDT node is
not in core, it must be fetched from the pool, which may in turn be
assisted by an L2ARC.  It's my understanding that, yes, several DDT
entries are stored in each on-disk "block", though I'm not certain of
the number.  The on-disk size of the DDT entry is different, too.

> 2. Deletion with dedup enabled is a lot heavier for some reason that I don't
> understand. It is said that the DDT entries have to be updated for each
> deleted reference to that block. Since zfs already have a mechanism for 
> sharing
> blocks (for example with snapshots), I don't understand why the DDT has to
> contain any more block references at all, or why deletion should be much 
> harder
> just because there are checksums (DDT entries) tied to those blocks, and even
> if they have to, why it would be much harder than the other block reference
> mechanism. If anyone could explain this (or give me a pointer to an
> explanation), I'd be very happy!

DDT entries are reference-counted.  Unlike other things that look like
multiple references, these are truly block-level independent.

Everything else is either tree-structured or highly aggregated (metaslab
free-space tracking).

Snapshots, for example, are references to a certain internal node (the
root of a filesystem tree at a certain txg), and that counts as a
reference to the entire subtree underneath.  Note that any changes to
this subtree later (via writes into the live filesystem) diverge
completely via CoW; an update produces a new CoW block tree all the way
back to the root, above the snapshot node. 

When a snapshot is created, it starts out owning (almost) nothing. As
data is overwritten, the ownership of the data that might otherwise be
freed is transferred to the snapshot.

When the oldest snapshot is freed, any data blocks it owns can be
freed. When an intermediate snapshot is freed, data blocks it owns are
either transferred to the previous older snapshot because they were
shared with it (txg < snapshot's) or they're unique to this snapshot
and can be freed.

Either way, these decisions are tree based and can potentially free
large swathes of space with a single decision, whereas the DDT needs
refcount updates individually for each block (in random order, as per
below).

(This is not the same as the ZPL directory tree used for naming,
however, don't get those confused, it's flatter than that).

> 3. I, as many others, would of course like to be able to have very large
> datasets deduped without having to have enormous amounts of RAM.
> Since the DDT is a AVL tree, couldn't just that entire tree be cached on
> for example a SSD and be searched there without necessarily having to store
> anything of it in RAM? That would probably require some changes to the DDT
> lookup code, and some mechanism to gather the tree to be able to lift it
> over to the SSD cache, and some other stuff, but still that sounds - with
> my very basic (non-)understanding of zfs - like a not to overwhelming change.

Think of this the other way round. One could do this, and could
require a dedicated device (SSD) in order to use dedup at all.  Now,
every DDT lookup requires IO to bring the DDT entry into memory.  This
would be slow, so we could add an in-memory cache for the DDT... and
we're back to square one.

The major issue with the DDT is that, being context-hash indexed, it
is random-access, even for sequential-access data.  There's no getting
around that, it's in its job description.

> 4. Now and then people mention that the problem with bp_rewrite has been
> explained, on this very mailing list I believe, but I haven't found that
> explanation. Could someone please give me a pointer to that description
> (or perhaps explain it again :-) )?

This relates to the answer for 2; all the pointers in the tree
discussed there are block pointers to device virtual addresses.  If
you're going to move data, you're going to change its address, which
necessitates updating all the trees that reference it with new
hashes. Several things make this tricky:

 - you're trying to follow references the wrong way, so there's a lot
   of tree-searching to be done, even just to find dependencies.
   Resolving those dependencies may be harder still with lots of
   combinatorial complexity and reverse searching for information.
 - you want to retain CoW semantics for safety of update in making the
   changes, yet the rest of the filesystem depends on the semantics of
   these blocks not changing.
 - as a result of the combination of the above, you may wind up with
   races/contention against live filesystem updates, scrubs and other
   errors/recoveries, and the need to add a lot of complex locking or
   other mechanism that's currently not needed.

It's not impossible, but you will wind up touching lots of code and
making all the tests much more complex. 

--
Dan.

Attachment: pgpEDxTArFtn9.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to