23.01.2019 19:33, Kevin Wolf wrote: > Am 23.01.2019 um 12:53 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 22.01.2019 21:57, Kevin Wolf wrote: >>> Am 11.01.2019 um 12:40 hat Vladimir Sementsov-Ogievskiy geschrieben: >>>> 11.01.2019 13:41, Kevin Wolf wrote: >>>>> Am 10.01.2019 um 14:20 hat Vladimir Sementsov-Ogievskiy geschrieben: >>>>>> drv_co_block_status digs bs->file for additional, more accurate search >>>>>> for hole inside region, reported as DATA by bs since 5daa74a6ebc. >>>>>> >>>>>> This accuracy is not free: assume we have qcow2 disk. Actually, qcow2 >>>>>> knows, where are holes and where is data. But every block_status >>>>>> request calls lseek additionally. Assume a big disk, full of >>>>>> data, in any iterative copying block job (or img convert) we'll call >>>>>> lseek(HOLE) on every iteration, and each of these lseeks will have to >>>>>> iterate through all metadata up to the end of file. It's obviously >>>>>> ineffective behavior. And for many scenarios we don't need this lseek >>>>>> at all. >>>>>> >>>>>> So, let's "5daa74a6ebc" by default, leaving an option to return >>>>>> previous behavior, which is needed for scenarios with preallocated >>>>>> images. >>>>>> >>>>>> Add iotest illustrating new option semantics. >>>>>> >>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >>>>> >>>>> I still think that an option isn't a good solution and we should try use >>>>> some heuristics instead. >>>> >>>> Do you think that heuristics would be better than fair cache for lseek >>>> results? >>> >>> I just played a bit with this (qemu-img convert only), and how much >>> caching lseek() results helps depends completely on the image. As it >>> happened, my test image was the worst case where caching didn't buy us >>> much. Obviously, I can just as easily construct an image where it makes >>> a huge difference. I think that most real-world images should be able to >>> take good advantage of it, though, and it doesn't hurt, so maybe that's >>> a first thing that we can do in any case. It might not be the complete >>> solution, though. >>> >>> Let me explain my test images: The case where all of this actually >>> matters for qemu-img convert is fragmented qcow2 images. If your image >>> isn't fragmented, we don't do lseek() a lot anyway because a single >>> bdrv_block_status() call already gives you the information for the whole >>> image. So I constructed a fragmented image, by writing to it backwards: >>> >>> ./qemu-img create -f qcow2 /tmp/test.qcow2 1G >>> for i in $(seq 16384 -1 0); do >>> echo "write $((i * 65536)) 64k" >>> done | ./qemu-io /tmp/test.qcow2 >>> >>> It's not really surprising that caching the lseek() results doesn't help >>> much there as we're moving backwards and lseek() only returns results >>> about the things after the current position, not before the current >>> position. So this is probably the worst case. >>> >>> So I constructed a second image, which is fragmented, too, but starts at >>> the beginning of the image file: >>> >>> ./qemu-img create -f qcow2 /tmp/test_forward.qcow2 1G >>> for i in $(seq 0 2 16384); do >>> echo "write $((i * 65536)) 64k" >>> done | ./qemu-io /tmp/test_forward.qcow2 >>> for i in $(seq 1 2 16384); do >>> echo "write $((i * 65536)) 64k" >>> done | ./qemu-io /tmp/test_forward.qcow2 >>> >>> Here caching makes a huge difference: >>> >>> time ./qemu-img convert -p -n $IMG null-co:// >>> >>> uncached cached >>> test.qcow2 ~145s ~70s >>> test_forward.qcow2 ~110s ~0.2s >> >> Unsure about your results, at least 0.2s means, that we benefit from >> cached read, not lseek. > > Yes, all reads are from the kernel page cache, this is on tmpfs. > > I chose tmpfs for two reasons: I wanted to get expensive I/O out of the > way so that the lseek() performance is even visible; and tmpfs was > reported to perform especially bad for SEEK_DATA/HOLE (which my results > confirm). So yes, this setup really makes the lseek() calls stand out > much more than in the common case (which makes sense when you want to > fix the overhead introduced by them).
Ok, missed this. On the other hand tmpfs is not a real production case.. > >> = my results = >> >> in short: >> >> uncached read: >> +--------------+--------+------+------+--------+ >> | | master | you | me | master | >> +--------------+--------+------+------+--------+ >> | test | 30.4 | 32.6 | 33.9 | 32.4 | >> | test_forward | 28.3 | 33.5 | 32.9 | 32.8 | >> +--------------+--------+------+------+--------+ >> >> ('you' is your patch, 'me' is my simple patch, see below) >> >> (I retested master, as first test run seemed noticeable faster than patched) >> So, No significant difference may be noticed (if ignore first run:). Or I >> need a lot more test runs, >> to calculate the average. >> However, I don't expect any difference here, I'm afraid that lseek() time >> becomes noticeable in >> comparison with read() for a lot larger disks. >> Also, problems of lseek are mostly related to lseek bugs, now seems that my >> kernel is not buggy.. >> What kernel and fs do you use to get such a significant difference between >> cached/uncached lseek? > > $ uname -r > 4.20.3-200.fc29.x86_64 I just didn't guess or missed that you run tests on tmps. > >> On the other hand, results with cached read are more interesting: >> >> +--------------+--------+------+------+--------+ >> | | master | you | me | master | >> +--------------+--------+------+------+--------+ >> | test | 0.24 | 0.20 | 0.16 | 0.24 | >> | test_forward | 0.24 | 0.16 | 0.16 | 0.24 | >> +--------------+--------+------+------+--------+ >> >> And they show, that my patch wins. So no lseeks = no problems. > > This is not surprising. You included the worst case for my patch, but > not the worst case for your patch. If I include that, too, I get: > > +-----------------+------------+------------+--------------+ > | | master | no lseek | cached lseek | > +-----------------+------------+------------+--------------+ > | on tmpfs: | | | | > | test | 115 | 0.16 | 58 | > | test_forward | 115 | 0.16 | 0.16 | > | test_prealloc | 0.07 | 0.65 | 0.07 | > +-----------------+------------+------------+--------------+ > | on xfs | | | | > | test | 0.20 | 0.16 | 0.18 | > | test_forward | 0.20 | 0.16 | 0.16 | > | test_prealloc | 0.07 | 0.73 | 0.07 | > +-----------------+------------+------------+--------------+ > | on xfs, -T none | | | | > | test | 1.33 | 1.31 | 1.33 | > | test_forward | 1.00 | 1.00 | 1.00 | > | test_prealloc | 0.08 | 0.65 | 0.08 | > +-----------------+------------+------------+--------------+ > > test_prealloc is an empty image with metadata preallocation: > > $ ./qemu-img create -f qcow2 -o preallocation=metadata > ~/tmp/test_prealloc.qcow2 1G > > So your patch is a clear winner for test on tmpfs, but also a clear > loser for test_prealloc on all backends. Otherwise, I don't see much of > a significant difference. Yes, of course. > >> Moreover, keeping in mind, that we in Virtuozzo don't have scenarios, >> where we'll benefit from lseeks, and after two slow-lseek bugs, it's >> definitely safer for me to just keep one out-of-tree patch, than >> rely on lseek-cache + lseek, which both are not needed in our case. > > With which filesystems did you have those slow lseek bugs? ext4. For example, the first bug was fixed by this: https://www.spinics.net/lists/linux-ext4/msg51346.html > > I mean, I can understand your point that you're not using preallocation > anyway, but for upstream, that's obviously not going to work. My > priority is getting something that improves the situation for everyone. > If we then still need a user-visible option to squeeze out the last > millisecond, we can talk about it. But the option can't be the primary > solution for everyone. > >> Of-course, lseek-cache is a good thing, and your patch seems reasonable, >> but I'll go with my patch anyway, and I proposed an option to bring >> such behavior to upstream, if someone wants it. > > Okay, if you think it makes sense, I can post it as a proper patch. > > Kevin > -- Best regards, Vladimir