> 1.  zio.c has functions like "zio_rewrite" and
> "zio_rewrite_gang_members".  ZFS is copy-on-write, so it should never
> be rewriting anything, right?   Also, zio_write_compress makes a
> cryptic reference to spa_sync.

Right.  The only time we rewrite an existing block is when the block
was allocated in the same transaction group we're currently syncing.
This can happen during "sync to convergence", which I'll describe
in a moment.

> 2. Gang Blocks: While not explicitly spelled out anywhere (except
> maybe the source code), it seems to me that the behavior is this:
> system needs to write a 128KB block, but can't allocate a contiguous
> 128KB (in which case, you've got issues), so it allocates two 64KB
> blocks and a 'gang block' to point to them.  When somebody tries to
> read back the original 128KB block, the ZIO subsystem reads the two
> 64KB halves and pieces them back together -- and the upper layers of
> code are none the wiser.   Is this correct?

Exactly right.

> 3. Gang Blocks II: Can a gang block point to other gang blocks?

Yes.  It's not likely to come up outside of testing, but it works.

> 4. Gang Blocks III: If a gang block contains up to 3 pointers
> (according to the 'on-disk format' doc) and it *cannot* point to other
> gang blocks, does that mean that ZIO can split a block into at most 3
> pieces?

No -- worst case, they can be nested as mentioned above.

> 5. spa_sync has a loop with the comment "Iterate to convergence".  I
> was under the impression that the sync operation just made sure all
> outstanding writes were committed to disk.  How is committing that
> data to disk going to change that data?

With the lone exception of the uberblock, *everything* in ZFS is stored
in transactional datasets.  This includes not just user data, but also
metadata, including pool-wide metadata like space maps.

On the first pass of spa_sync() we write to disk every block that
was modified in that transaction group -- this is dsl_pool_sync().
As a side effect, we allocate and free a bunch of blocks, which we
record in our in-core space map structures.

The next thing we do is propagate these in-core space map chages
to their on-disk counterparts by writing to the space map objects
(vdev_sync() -> metaslab_sync() -> space_map_sync() -> dmu_write()).
The act of doing this marks the space map objects as modified.
This is fundamentally no different than modifying any other object.

However, we have a chicken-and-egg problem: we now have a modified
dataset (the pool's MOS, or meta-objset-set) that has to be synced.
So, we now enter the second pass of spa_sync().  Here we do the exact
same thing as we did on the first pass, but of course there's a lot
less data this time.  We keep doing this until the pool stops wiggling.
The thing is, this iterative process would never converge if we
allocated new blocks on every pass.  So on each pass, when writing
to a particular block, we first ask whether that block was born in the
same transaction group that we're currently syncing.  If so, then
since it's not part of any prior transaction group, and the current
transaction group is not yet committed, we can safely overwrite the
existing block rather than freeing it and allocating a new one --
the important implication being that no space maps are modified
in the process.  That's what allows spa_sync() to converge.

There are a few twists worth noting.

First, compression adds a wrinkle because if the data compresses
to a different size, we have to allocate a new block of that size.
In theory, this could go on forever.  To guard against it, we stop
compressing after the first few passes.

Second, it's generally faster to write to newly-allocated space, which
tends to be physically contiguous, than to be forced to rewrite at a
specific location on disk.  That's one of the benefits of a copy-on-write
approach in general.  So we have the option of continuing to allocate new
blocks for the first few passes, then switch to rewrites when there are
only a few blocks left.  In practice, however, we haven't found this to be
a net win, because there just aren't many blocks after the first pass.
The benefits of copy-on-write locality seem to be neutralized by the cost
of additional sync passes.

Third, there are many space maps in a large pool -- a few hundred per
device times as many devices as you've got.  The more space maps you
touch, the longer it takes for spa_sync() to converge.  We can make
block allocation as localized as we want -- typically touching just
one space map -- but we have no control over the locality of frees.
Therefore, after the first few passes we start recording frees not in
their space maps, but in a single deferred-free list (zio_free() ->
bplist_enqueue_deferred()).  We then processs the deferred frees at the
beginning of the next transaction group (spa_sync_deferred_frees()).

For testing purposes, there are tunables that govern the thresholds
for each of these three behaviors -- see zio_sync_pass.

Jeff

Reply via email to