On 2018-08-20 18:33, Vladimir Sementsov-Ogievskiy wrote: > 17.08.2018 22:34, Max Reitz wrote: >> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >>> 16.08.2018 03:51, Max Reitz wrote: >>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>>> Hi all! >>>>> >>>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>>> reads and writes. Both qcow2 read and write functions loops through >>>>> sequential portions of data. The series aim it to parallelize these >>>>> loops iterations. >>>>> >>>>> It improves performance for fragmented qcow2 images, I've tested it >>>>> as follows: >>>>> >>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd >>>>> disk: >>>>> t-seq.qcow2 - sequentially written qcow2 image >>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random >>>>> order >>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m >>>>> clusters >>>>> (see source code of image generation in the end for details) >>>>> >>>>> and the test (sequential io by 1mb chunks): >>>>> >>>>> test write: >>>>> for t in /ssd/t-*; \ >>>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t >>>>> ===; \ >>>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w >>>>> $t; \ >>>>> done >>>>> >>>>> test read (same, just drop -w parameter): >>>>> for t in /ssd/t-*; \ >>>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t >>>>> ===; \ >>>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>>> done >>>>> >>>>> short info about parameters: >>>>> -w - do writes (otherwise do reads) >>>>> -c - count of blocks >>>>> -s - block size >>>>> -t none - disable cache >>>>> -n - native aio >>>>> -d 1 - don't use parallel requests provided by qemu-img bench >>>>> itself >>>> Hm, actually, why not? And how does a guest behave? >>>> >>>> If parallel requests on an SSD perform better, wouldn't a guest issue >>>> parallel requests to the virtual device and thus to qcow2 anyway? >>> Guest knows nothing about qcow2 fragmentation, so this kind of >>> "asynchronization" could be done only at qcow2 level. >> Hm, yes. I'm sorry, but without having looked closer at the series >> (which is why I'm sorry in advance), I would suspect that the >> performance improvement comes from us being able to send parallel >> requests to an SSD. >> >> So if you send large requests to an SSD, you may either send them in >> parallel or sequentially, it doesn't matter. But for small requests, >> it's better to send them in parallel so the SSD always has requests in >> its queue. >> >> I would think this is where the performance improvement comes from. But >> I would also think that a guest OS knows this and it would also send >> many requests in parallel so the virtual block device never runs out of >> requests. >> >>> However, if guest do async io, send a lot of parallel requests, it >>> behave like qemu-img without -d 1 option, and in this case, >>> parallel loop iterations in qcow2 doesn't have such great sense. >>> However, I think that async parallel requests are better in >>> general than sequential, because if device have some unused opportunity >>> of parallelization, it will be utilized. >> I agree that it probably doesn't make things worse performance-wise, but >> it's always added complexity (see the diffstat), which is why I'm just >> routinely asking how useful it is in practice. :-) >> >> Anyway, I suspect there are indeed cases where a guest doesn't send many >> requests in parallel but it makes sense for the qcow2 driver to >> parallelize it. That would be mainly when the guest reads seemingly >> sequential data that is then fragmented in the qcow2 file. So basically >> what your benchmark is testing. :-) >> >> Then, the guest could assume that there is no sense in parallelizing it >> because the latency from the device is large enough, whereas in qemu >> itself we always run dry and wait for different parts of the single >> large request to finish. So, yes, in that case, parallelization that's >> internal to qcow2 would make sense. >> >> Now another question is, does this negatively impact devices where >> seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I >> don't have access to an HDD to test myself... > > > hdd: > > +-----------+-----------+----------+-----------+----------+ > | file | wr before | wr after | rd before | rd after | > +-----------+-----------+----------+-----------+----------+ > | seq | 39.821 | 40.513 | 38.600 | 38.916 | > | reverse | 60.320 | 57.902 | 98.223 | 111.717 | > | rand | 614.826 | 580.452 | 672.600 | 465.120 | > | part-rand | 52.311 | 52.450 | 37.663 | 37.989 | > +-----------+-----------+----------+-----------+----------+ > > hmm. 10% degradation on "reverse" case, strange magic.. However reverse > is near to impossible.
I tend to agree. It's faster for random, and that's what matters more. (Distinguishing between the cases in qcow2 seems like not so good of an idea, and making it user-configurable is probably pointless because noone will change the default.) Max
signature.asc
Description: OpenPGP digital signature