On 04/18/2017 02:22 PM, Kevin Wolf wrote: > Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben: >> [skipped...] >> >>> Hi Denis, >>> >>> I've read this entire thread now and I really like Berto's summary which >>> I think is one of the best recaps of existing qcow2 problems and this >>> discussion so far. >>> >>> I understand your opinion that we should focus on compatible changes >>> before incompatible ones, and I also understand that you are very >>> concerned about physical fragmentation for reducing long-term IO. >>> >>> What I don't understand is why you think that subclusters will increase >>> fragmentation. If we admit that fragmentation is a problem now, surely >>> increasing cluster sizes to 1 or 2 MB will only help to *reduce* >>> physical fragmentation, right? >>> >>> Subclusters as far as I am understanding them will not actually allow >>> subclusters to be located at virtually disparate locations, we will >>> continue to allocate clusters as normal -- we'll just keep track of >>> which portions of the cluster we've written to to help us optimize COW*. >>> >>> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if >>> we write just the first subcluster, we'll have a map like: >>> >>> X--------------- >>> >>> Whatever actually happens to exist in this space, whether it be a hole >>> we punched via fallocate or literal zeroes, this space is known to the >>> filesystem to be contiguous. >>> >>> If we write to the last subcluster, we'll get: >>> >>> X--------------X >>> >>> And again, maybe the dashes are a fallocate hole, maybe they're zeroes. >>> but the last subcluster is located virtually exactly 15 subclusters >>> behind the first, they're not physically contiguous. We've saved the >>> space between them. Future out-of-order writes won't contribute to any >>> fragmentation, at least at this level. >>> >>> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune >>> the subclusters right, we'll have *zero*, won't we? >>> >>> As far as I can tell, this lets us do a lot of good things all at once: >>> >>> (1) We get some COW optimizations (for reasons Berto and Kevin have >>> detailed) >> Yes. We are fine with COW. Let us assume that we will have issued read >> entire cluster command after the COW, in the situation >> >> X--------------X >> >> with a backing store. This is possible even with 1-2 Mb cluster size. >> I have seen 4-5 Mb requests from the guest in the real life. In this >> case we will have 3 IOP: >> read left X area, read backing, read right X. >> This is the real drawback of the approach, if sub-cluster size is really >> small enough, which should be the case for optimal COW. Thus we >> will have random IO in the host instead of sequential one in guest. >> Thus we have optimized COW at the cost of long term reads. This >> is what I am worrying about as we can have a lot of such reads before >> any further data change. > So just to avoid misunderstandings about what you're comparing here: > You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you > would get only a single operation for 2 MB clusters without subclusters. > Today's 64k clusters without subclusters behave no better than the > 2M/64k version, but that's not what you're comparing. > > Yes, you are correct about this observation. But it is a tradeoff that > you're intentionally making when using backing files. In the extreme, > there is an alternative that performs so much better: Instead of using a > backing file, use 'qemu-img convert' to copy (and defragment) the whole > image upfront. No COW whatsoever, no fragmentation, fast reads. The > downside is that it takes a while to copy the whole image upfront, and > it also costs quite a bit of disk space. in general, for production environments, this is total pain. We have a lot of customers with Tb images. Free space is also a real problem for them.
> So once we acknowledge that we're dealing with a tradeoff here, it > becomes obvious that neither the extreme setup for performance (copy the > whole image upfront) nor the extreme setup for sparseness (COW on a > sector level) are the right default for the average case, nor is > optimising one-sidedly a good idea. It is good if we can provide > solutions for extreme cases, but by default we need to cater for the > average case, which cares both about reasonable performance and disk > usage. yes, I agree. But 64kb cluster size by default for big images (not for backup!) is another extreme ;) Who will care with 1 Tb image or 10 Tb image about several Mbs. Pls note, that 1 Mb is better for the default block size as with this size sequential write is equal to the random write for non-SSD disks. Den