On Fri, May 11, 2012 at 9:27 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote: > On Fri, May 4, 2012 at 3:44 AM, Zhi Yong Wu <zwu.ker...@gmail.com> wrote: >> On Sun, Apr 29, 2012 at 1:35 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote: >>> On Sat, Apr 28, 2012 at 5:25 PM, Zhi Yong Wu <zwu.ker...@gmail.com> wrote: >>> >>> Here explanations for quick ones that I know without looking at the >>> code. I think they will help you understand some of the other ones >>> too. >>> >>> Please send questions like this to qemu-devel@nongnu.org so the thread >>> is archived on the mailing lists and others can learn from it. >> I feel better and free if it is in private way. :) > > No one else in the community can learn from the discussion or help > out. We need to be comfortable doing things in public. OK, i public this mail discussion, and hope that i can help other guys interested in qcow2. > >>> >>>> 1. >>>> static int get_refcount(BlockDriverState *bs, int64_t cluster_index) >>>> { >>>> ... >>>> refcount_table_index = cluster_index >> (s->cluster_bits - >>>> REFCOUNT_SHIFT); >>>> ... >>>> >>>> block_index = cluster_index & >>>> ((1 << (s->cluster_bits - REFCOUNT_SHIFT)) - 1); >>>> How to understand the two expressions? >>>> ... >>>> } >>> >>> See "Host cluster management" in the qcow2 spec. Refcounts are stored >>> in a 2-level table, refcount_table_index is the L1 table where we >>> store the offset of refcount blocks. block_index is the index of the >>> refcount block element that contains the actual reference count. >> I knew what refcount_table_index and block_index mean. Actually what i >> want to ask is that what "cluster_index >> (s->cluster_bits - >> REFCOUNT_SHIFT)" and "cluster_index & ((1 << (s->cluster_bits - >> REFCOUNT_SHIFT)) - 1)" mean? why are they refcount_table_index and >> block_index? why to need REFCOUNT_SHIFT? because refcount block entry >> is 16 bits? > > The cluster_index is the image file cluster number: > > | 0 | 1 | 2 | 3 | ... > image file > > So these calculations are just dividing and finding the remainder for > the 2-level refcount data structure. For example, if the refcount > block holds 2 entries, then we have: > > | 0 | 1 | 2 | 3 | ... > image file > > | A | B | ... > image file -> refcount tables > > A: | x | x | B: | x | x | > refcount tables > > So cluster_index = 1 means we need to look at refcount table A at index 1. > > In other words: > refcount_table_index = cluster_index / 2 > block_index = cluster_index % 2 > > The shifts and bitwise ands are just another way of expressing the > division/modulus operation. And instead of using a constant like "2", > qcow2 supports variable cluster sizes (cluster_bits) and has a > REFCOUNT_SHIFT constant. Great, i can now understand the two expressions and other similar ones, thanks. > >>> >>>> 2. >>>> >>>> static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, >>>> int64_t offset, int64_t length, int addend) >>>> { >>>> .... >>>> if (addend < 0) { >>>> qcow2_cache_set_dependency(bs, s->refcount_block_cache, >>>> s->l2_table_cache); >>>> } >>>> When added = -1, why to need to invoke qcow2_cache_set_dependency? >>> >>> The cache dependency is used to ensure that cached metadata is flushed >>> in the correct order. qcow2 must be careful to flush data to disk so >>> that writes are ordered - otherwise a power failure could corrupt the >>> image file when unordered writes are partially applied to the image >>> file. >> great, thanks >>> >>> For example, we want to allocate an image file cluster ("host >>> cluster") *before* we reference it from a table. Otherwise a power >> But this example result in an allocated cluster will not be referenced >> by one table entry if a power failure take place. > > It's not possible to keep the image in a "clean" state at all times, > but we need to keep it in a "consistent" state at all times. > "Consistent" means: > 1. All L1/L2 table entries should point to allocated clusters. > 2. Data should be written to allocated clusters before they are > referenced by L1/L2 entries. > > "Consistent" is weaker than "clean". A "consistent" image can have > leaked clusters that are allocated (refcount > 0) but not referenced > by any L1/L2 table. > > For multi-step metadata updates you will find there is no order that > is "clean" at every step - it's simply not possible, no matter which > order you choose. But there is an order the is "consistent" at every > step and that's good enough (it means we will not lose data or corrupt > the image file). Great, thanks. > > BTW this is why file systems use journals. Journals solve the > consistent update problem in a different way - after a crash the > journal is replayed to finish all in-flight updates, resulting in a > consistent file system. It's a different technique that non of the > popular image formats use. > >>> failure could result in the table pointing to an unallocated cluster >>> (the refcount update did not make it to disk but the L2 table update >> Why to say that refcount update did not? refcount block entry will >> count the reference of one cluster. > > Because disk writes may sit in a volatile disk write cache or host OS > page cache. The order in which cached data is really written to disk > is not defined. That means we don't know if the refcount update makes > it to disk before the power fails. > >>>> 12. >>>> >>>> static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs) >>>> >>>> I thought that this function should only flush the data to OS cache, >>>> not disk. right? But i checked it and found that it flush the data to >>>> the disk. >>> >>> Metadata updates must be handled very carefully - they need to be >>> ordered so that a power failure never leaves the image file in an >>> inconsistent (corrupt) state. Therefore we *do* need to flush to disk >> Then should we adjust this function name? otherwise it will confuse us. > > Maybe. The BlockDriver callback name makes sense. It just happens > that the qcow2 implementation really does flush so that metadata > updates are ordered. kevin has explained this in another mail thread. > >>> when applying several different types of metadata updates in order >>> (e.g. L2 table, refcount table). >> Should we usually update L2 table *before* refcount table update or at >> first update refcount table *before* L2 table? > > When allocating clusters we first need to update the refcount table, > write the data into the cluster, and then perform the L2 update. When freeing clusters, the order will be reverse. > >>>> 14. >>>> >>>> qcow2_snapshot_create() { >>>> .... >>>> >>>> ret = bdrv_flush(bs); >>>> Why does it need flush the data here? >>> >>> To understand the meaning of a flush, look at the operation that was >>> performed before it. >>> >>> Here we just incremented the refcounts. We need to make sure these >>> metadata updates are on disk before we can add the snapshot entry into >>> the qcow2 image file - otherwise a power failure could result in a >>> snapshot entry without allocated clusters. >> But it may also result in that the clusters are allocated, but there >> is no corresponding snapshot entry. > > Yes. The image will still be "consistent", just not "clean". We > leaked clusters but did not lose data or corrupt the image file. Great, thanks. > > Stefan
-- Regards, Zhi Yong Wu