Am 22.07.2011 11:13, schrieb Frediano Ziglio: > 2011/7/22 Kevin Wolf <kw...@redhat.com>: >> Am 21.07.2011 18:17, schrieb Frediano Ziglio: >>> Hi, >>> after a snapshot is taken currently many write operations are quite >>> slow due to >>> - refcount updates (decrement old and increment new ) >>> - cluster allocation and file expansion >>> - read-modify-write on partial clusters >>> >>> I found 2 way to improve refcount performance >>> >>> Method 1 - Lazy count >>> Mainly do not take into account count for current snapshot, that is >>> current snapshot counts as 0. This would require to add a >>> current_snapshot in header and update refcount when current is changed. >>> So for these operation >>> - creating snapshot, performance are the same, just increment for old >>> snapshot instead of the new one >>> - normal write operations. As current snaphot counts as 0 there is not >>> operations here so do not write any data >>> - changing current snapshot, this is the worst case, you have to >>> increment for the current snapshot and decrement for the new so it will >>> take twice >>> - deleting snapshot, if is the current just set current_snapshot to a >>> dummy not existing value, if is not the current just decrement counters, >>> no performance changes >> >> How would you do cluster allocation if you don't have refcounts any more >> that can tell you if a cluster is used or not? >> > > You have refcount, is only that current snapshot counts as 0. An > example may help, start with "A" snapshot A counts as zero so all > refcounts are 0, now we create a snapshot "B" and make it current so > refcounts are 1 > > A --- B > > If you change a cluster in snapshot "B" counts are still 1. If you go > back to "A" counters are increment (cause you leave B) and then > decrement (cause you enter in A). > > Perhaps the problem is how to distinguish 0 from "allocated in > current" and "not allocated". Yes, with which I suppose above it's a > problem, but we can easily use -1 as not allocated. If current and > refcount 0 mark as -1, if not current we would have to increment > counters of current, mark current as -1 than decrement for deleting, > yes in this case you have twice the time.
Yes, this is the problem that I meant. If you use -1 for not allocated, you're back to our current situation, just with refcount - 1 for each cluster. In particular, you now need to update refcounts again on writes (in order to change from -1 to 0). >>> Method 2 - Read-only parent >>> Here parents are readonly, instead of storing a refcount store a numeric >>> id of the owner. If the owner is not current copy the cluster and change >>> it. Considering this situation >>> >>> A --- B --- C >>> >>> B cannot be changed so in order to "change" B you have to create a new >>> snapshot >>> >>> A --- B --- C >>> \--- D >>> >>> and change D. It can take more space cause you have in this case an >>> additional snapshot. >>> >>> Operations: >>> - creating snapshot, really fast as you don't have to change any >>> ownership >>> - normal write operations. If owner is not the same allocate a new >>> cluster and just store a new owner for new cluster. Also ownership for >>> past-to-end cluster could be set all to current owner in order to >>> collapse allocations >>> - changing current snapshot, no changes required for owners >>> - deleting snapshot. Only possible if you have no child or a single >>> child. Will require to scan all l2 tables and merge and update owner. >> >> I think this has similar characteristics as we have with external >> snapshots (i.e. backing files). The advantage is that with applying it >> to internal snapshots is that when deleting a snapshot you don't have to >> copy around all the data. >> >> Probably this change could even be done transparently for the user, so >> that B still appears to be writeable, but in fact refers to D now. >> >> >> Anyway, have you checked how bad the refcount work really is? I think >> that writing the VM state takes a lot longer, so that optimising the >> refcount update may be the wrong approach, especially if it requires a >> format change. My results with qemu-img snapshot suggest that it's not >> worth it: >> >> kwolf@dhcp-5-188:~/images$ ~/source/qemu/qemu-img info scratch.qcow2 >> image: scratch.qcow2 >> file format: qcow2 >> virtual size: 8.0G (8589934592 bytes) >> disk size: 4.0G >> cluster_size: 65536 >> kwolf@dhcp-5-188:~/images$ time ~/source/qemu/qemu-img snapshot -c test >> scratch.qcow2 >> >> real 0m0.116s >> user 0m0.009s >> sys 0m0.040s >> kwolf@dhcp-5-188:~/images$ time ~/source/qemu/qemu-img snapshot -d test >> scratch.qcow2 >> >> real 0m0.084s >> user 0m0.011s >> sys 0m0.044s >> >> Kevin > > I'm not worried about time just taking snapshot more after taking > snapshot during normal use. As you stated taking snapshot you can > disable cache writethrough making it very fast but during normal > operations you can't. Well, the obvious solution is not using writethrough in this case. You need it only for some broken guest OSes. The other solution is adding a dirty flag which says that the refcount on disk may not be accurate and the refcount must be rebuilt after a crash. In this case you can drive the metadata cache in a writeback mode even with cache=writethrough. This dirty flag is included in my proposal for qcow2v3. > Personally I'm pondering a log too to allow collapsing metadata > updates. Even an external (another file) full log (with data) to try > to reduce even overhead caused by read-modify-write during partial > cluster updates and reduce file fragmentation. But as you can see from > my patches I'm still exercising myself with Qemu code. A journal is something to consider, yes. It's something that requires some development effort, but long term I think it could provide some nice advantages. I'm not sure if using it for the full data will help, but for metadata it would certainly make sense. Kevin