On 08/17/2018 10:34 PM, Max Reitz wrote: > On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >> 16.08.2018 03:51, Max Reitz wrote: >>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>> reads and writes. Both qcow2 read and write functions loops through >>>> sequential portions of data. The series aim it to parallelize these >>>> loops iterations. >>>> >>>> It improves performance for fragmented qcow2 images, I've tested it >>>> as follows: >>>> >>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk: >>>> t-seq.qcow2 - sequentially written qcow2 image >>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order >>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters >>>> (see source code of image generation in the end for details) >>>> >>>> and the test (sequential io by 1mb chunks): >>>> >>>> test write: >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ >>>> done >>>> >>>> test read (same, just drop -w parameter): >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>> done >>>> >>>> short info about parameters: >>>> -w - do writes (otherwise do reads) >>>> -c - count of blocks >>>> -s - block size >>>> -t none - disable cache >>>> -n - native aio >>>> -d 1 - don't use parallel requests provided by qemu-img bench itself >>> Hm, actually, why not? And how does a guest behave? >>> >>> If parallel requests on an SSD perform better, wouldn't a guest issue >>> parallel requests to the virtual device and thus to qcow2 anyway? >> Guest knows nothing about qcow2 fragmentation, so this kind of >> "asynchronization" could be done only at qcow2 level. > Hm, yes. I'm sorry, but without having looked closer at the series > (which is why I'm sorry in advance), I would suspect that the > performance improvement comes from us being able to send parallel > requests to an SSD. > > So if you send large requests to an SSD, you may either send them in > parallel or sequentially, it doesn't matter. But for small requests, > it's better to send them in parallel so the SSD always has requests in > its queue. > > I would think this is where the performance improvement comes from. But > I would also think that a guest OS knows this and it would also send > many requests in parallel so the virtual block device never runs out of > requests. > >> However, if guest do async io, send a lot of parallel requests, it >> behave like qemu-img without -d 1 option, and in this case, >> parallel loop iterations in qcow2 doesn't have such great sense. >> However, I think that async parallel requests are better in >> general than sequential, because if device have some unused opportunity >> of parallelization, it will be utilized. > I agree that it probably doesn't make things worse performance-wise, but > it's always added complexity (see the diffstat), which is why I'm just > routinely asking how useful it is in practice. :-) > > Anyway, I suspect there are indeed cases where a guest doesn't send many > requests in parallel but it makes sense for the qcow2 driver to > parallelize it. That would be mainly when the guest reads seemingly > sequential data that is then fragmented in the qcow2 file. So basically > what your benchmark is testing. :-) > > Then, the guest could assume that there is no sense in parallelizing it > because the latency from the device is large enough, whereas in qemu > itself we always run dry and wait for different parts of the single > large request to finish. So, yes, in that case, parallelization that's > internal to qcow2 would make sense. > > Now another question is, does this negatively impact devices where > seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I > don't have access to an HDD to test myself... There are different situations and different load pattern, f.e. there are situations when the guest executes sequential read in a single thread. This looks obvious and dummy, but this is for sure is possible in the real life. Also there is an observation, that Windows guest prefers long requests. There is not unusual to observe 4Mb requests in a pipeline.
Thus for such a load in a scattered file the performance difference should be very big, even on SSD as without this SSD will starve without requests. Here we are speaking in terms of latency, which definitely will be bigger in sequential case. Den >> We've already >> use this approach in mirror and qemu-img convert. > Indeed, but here you could always argue that this is just what guests > do, so we should, too. > >> In Virtuozzo we have >> backup, improved by parallelization of requests >> loop too. I think, it would be good to have some general code for such >> things in future. > Well, those are different things, I'd think. Parallelization in > mirror/backup/convert is useful not just because of qcow2 issues, but > also because you have a volume to read from and a volume to write to, so > that's where parallelization gives you some pipelining. And it gives > you buffers for latency spikes, I guess. > > Max >