Re: [Qemu-devel] Some questions about qcow2 code

Zhi Yong Wu Mon, 14 May 2012 08:03:34 -0700

On Fri, May 11, 2012 at 9:27 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
> On Fri, May 4, 2012 at 3:44 AM, Zhi Yong Wu <zwu.ker...@gmail.com> wrote:
>> On Sun, Apr 29, 2012 at 1:35 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
>>> On Sat, Apr 28, 2012 at 5:25 PM, Zhi Yong Wu <zwu.ker...@gmail.com> wrote:
>>>
>>> Here explanations for quick ones that I know without looking at the
>>> code.  I think they will help you understand some of the other ones
>>> too.
>>>
>>> Please send questions like this to qemu-devel@nongnu.org so the thread
>>> is archived on the mailing lists and others can learn from it.
>> I feel better and free if it is in private way. :)
>
> No one else in the community can learn from the discussion or help
> out.  We need to be comfortable doing things in public.
OK, i public this mail discussion, and hope that i can help other guys
interested in qcow2.
>
>>>
>>>> 1.
>>>> static int get_refcount(BlockDriverState *bs, int64_t cluster_index)
>>>> {
>>>> ...
>>>>    refcount_table_index = cluster_index >> (s->cluster_bits - 
>>>> REFCOUNT_SHIFT);
>>>> ...
>>>>
>>>>    block_index = cluster_index &
>>>>        ((1 << (s->cluster_bits - REFCOUNT_SHIFT)) - 1);
>>>> How to understand the two expressions?
>>>> ...
>>>> }
>>>
>>> See "Host cluster management" in the qcow2 spec.  Refcounts are stored
>>> in a 2-level table, refcount_table_index is the L1 table where we
>>> store the offset of refcount blocks.  block_index is the index of the
>>> refcount block element that contains the actual reference count.
>> I knew what refcount_table_index and block_index mean. Actually what i
>> want to ask is that what "cluster_index >> (s->cluster_bits -
>> REFCOUNT_SHIFT)" and "cluster_index & ((1 << (s->cluster_bits -
>> REFCOUNT_SHIFT)) - 1)" mean? why are they refcount_table_index and
>> block_index? why to need REFCOUNT_SHIFT? because refcount block entry
>> is 16 bits?
>
> The cluster_index is the image file cluster number:
>
> | 0 | 1 | 2 | 3 | ...
> image file
>
> So these calculations are just dividing and finding the remainder for
> the 2-level refcount data structure.  For example, if the refcount
> block holds 2 entries, then we have:
>
> | 0 | 1 | 2 | 3 | ...
> image file
>
> | A     | B     | ...
> image file -> refcount tables
>
> A: | x | x | B: | x | x |
> refcount tables
>
> So cluster_index = 1 means we need to look at refcount table A at index 1.
>
> In other words:
> refcount_table_index = cluster_index / 2
> block_index = cluster_index % 2
>
> The shifts and bitwise ands are just another way of expressing the
> division/modulus operation.  And instead of using a constant like "2",
> qcow2 supports variable cluster sizes (cluster_bits) and has a
> REFCOUNT_SHIFT constant.
Great, i can now understand the two expressions and other similar ones, thanks.
>
>>>
>>>> 2.
>>>>
>>>> static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>    int64_t offset, int64_t length, int addend)
>>>> {
>>>> ....
>>>>    if (addend < 0) {
>>>>        qcow2_cache_set_dependency(bs, s->refcount_block_cache,
>>>>            s->l2_table_cache);
>>>>    }
>>>> When added = -1, why to need to invoke qcow2_cache_set_dependency?
>>>
>>> The cache dependency is used to ensure that cached metadata is flushed
>>> in the correct order.  qcow2 must be careful to flush data to disk so
>>> that writes are ordered - otherwise a power failure could corrupt the
>>> image file when unordered writes are partially applied to the image
>>> file.
>> great, thanks
>>>
>>> For example, we want to allocate an image file cluster ("host
>>> cluster") *before* we reference it from a table.  Otherwise a power
>> But this example result in an allocated cluster will not be referenced
>> by one table entry if a power failure take place.
>
> It's not possible to keep the image in a "clean" state at all times,
> but we need to keep it in a "consistent" state at all times.
> "Consistent" means:
> 1. All L1/L2 table entries should point to allocated clusters.
> 2. Data should be written to allocated clusters before they are
> referenced by L1/L2 entries.
>
> "Consistent" is weaker than "clean".  A "consistent" image can have
> leaked clusters that are allocated (refcount > 0) but not referenced
> by any L1/L2 table.
>
> For multi-step metadata updates you will find there is no order that
> is "clean" at every step - it's simply not possible, no matter which
> order you choose.  But there is an order the is "consistent" at every
> step and that's good enough (it means we will not lose data or corrupt
> the image file).
Great, thanks.
>
> BTW this is why file systems use journals.  Journals solve the
> consistent update problem in a different way - after a crash the
> journal is replayed to finish all in-flight updates, resulting in a
> consistent file system.  It's a different technique that non of the
> popular image formats use.
>
>>> failure could result in the table pointing to an unallocated cluster
>>> (the refcount update did not make it to disk but the L2 table update
>> Why to say that refcount update did not? refcount block entry will
>> count the reference of one cluster.
>
> Because disk writes may sit in a volatile disk write cache or host OS
> page cache.  The order in which cached data is really written to disk
> is not defined.  That means we don't know if the refcount update makes
> it to disk before the power fails.
>
>>>> 12.
>>>>
>>>> static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>>>>
>>>> I thought that this function should only flush the data to OS cache,
>>>> not disk. right? But i checked it and found that it flush the data to
>>>> the disk.
>>>
>>> Metadata updates must be handled very carefully - they need to be
>>> ordered so that a power failure never leaves the image file in an
>>> inconsistent (corrupt) state.  Therefore we *do* need to flush to disk
>> Then should we adjust this function name? otherwise it will confuse us.
>
> Maybe.  The BlockDriver callback name makes sense.  It just happens
> that the qcow2 implementation really does flush so that metadata
> updates are ordered.
kevin has explained this in another mail thread.
>
>>> when applying several different types of metadata updates in order
>>> (e.g. L2 table, refcount table).
>> Should we usually update L2 table *before* refcount table update or at
>> first update refcount table *before* L2 table?
>
> When allocating clusters we first need to update the refcount table,
> write the data into the cluster, and then perform the L2 update.
When freeing clusters, the order will be reverse.
>
>>>> 14.
>>>>
>>>> qcow2_snapshot_create()  {
>>>> ....
>>>>
>>>>    ret = bdrv_flush(bs);
>>>> Why does it need flush the data here?
>>>
>>> To understand the meaning of a flush, look at the operation that was
>>> performed before it.
>>>
>>> Here we just incremented the refcounts.  We need to make sure these
>>> metadata updates are on disk before we can add the snapshot entry into
>>> the qcow2 image file - otherwise a power failure could result in a
>>> snapshot entry without allocated clusters.
>> But it may also result in that the clusters are allocated, but there
>> is no corresponding snapshot entry.
>
> Yes.  The image will still be "consistent", just not "clean".  We
> leaked clusters but did not lose data or corrupt the image file.
Great, thanks.
>
> Stefan




-- 
Regards,

Zhi Yong Wu

Re: [Qemu-devel] Some questions about qcow2 code

Reply via email to