On 28.07.20 01:09, Bruce Rogers wrote: > On Tue, 2020-07-21 at 10:22 +0200, Claudio Fontana wrote: >> On 7/20/20 8:24 PM, Claudio Fontana wrote: >>> I have now been able to reproduce this on X86 as well. >>> >>> It happens much more rarely, about once every 10 times. >>> >>> I will sort out the data and try to make it even more reproducible, >>> then post my findings in detail. >>> >>> Overall I proceeded as follows: >>> >>> 1) hooked the savevm code to skip all fields with the exception of >>> "s390-skeys". So only s390-skeys are actually saved. >>> >>> 2) reimplemented "s390-skeys" in a common implementation in cpus.c, >>> used on both x86 and s390, modeling the behaviour of save/load from >>> hw/s390 >>> >>> 3) ran ./check -qcow2 267 on both x86 and s390. >>> >>> In the case of s390, failure seems to be reproducible 100% of the >>> times. >>> On X86, it is as mentioned failing about 10% of the times. >>> >>> Ciao, >>> >>> Claudio >> >> And here is a small series of two patches that can be used to >> reproduce the problem. >> >> Clearly, this is not directly related to s390 or to skeys or to >> icount in particular, it is just an issue that happened to be more >> visible there. >> >> If you could help with this, please apply the attached patches. >> >> Patch 1 just adds a new "300" iotest. It is way easier to extract the >> relevant part out of test 267, which does a bit too much in the same >> file. >> Also this allows easier use of valgrind, since it does not "require" >> anything. >> >> Patch 2 hooks the savevm code to skip all fields during the snapshot >> with the exception of "s390-skeys", a new artificial field >> implemented to >> model what the real s390-skeys is doing. >> >> After applying patch 1 and patch 2, you can test (also on X86), with: >> >> ./check -qcow2 300 >> >> On X86 many runs will be successful, but a certain % of them will >> instead fail like this: >> >> >> claudio@linux-ch70:~/git/qemu-pristine/qemu-build/tests/qemu-iotests> >> ./check -qcow2 300 >> QEMU -- "/home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64" >> -nodefaults -display none -accel qtest >> QEMU_IMG -- "/home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/../../qemu-img" >> QEMU_IO -- "/home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/../../qemu-io" --cache writeback --aio >> threads -f qcow2 >> QEMU_NBD -- "/home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/../../qemu-nbd" >> IMGFMT -- qcow2 (compat=1.1) >> IMGPROTO -- file >> PLATFORM -- Linux/x86_64 linux-ch70 4.12.14-lp151.28.36-default >> TEST_DIR -- /home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/scratch >> SOCK_DIR -- /tmp/tmp.gdcUu3l0SM >> SOCKET_SCM_HELPER -- /home/claudio/git/qemu-pristine/qemu- >> build/tests/qemu-iotests/socket_scm_helper >> >> 300 fail [10:14:05] [10:14:06] (last: 0s) output >> mismatch (see 300.out.bad) >> --- /home/claudio/git/qemu-pristine/qemu/tests/qemu- >> iotests/300.out 2020-07-21 10:03:54.468104764 +0200 >> +++ /home/claudio/git/qemu-pristine/qemu-build/tests/qemu- >> iotests/300.out.bad 2020-07-21 10:14:06.098090543 +0200 >> @@ -12,6 +12,9 @@ >> ID TAG VM SIZE DATE VM >> CLOCK >> -- snap0 SIZE yyyy-mm-dd >> hh:mm:ss 00:00:00.000 >> (qemu) loadvm snap0 >> +Unexpected storage key data: 0 >> +error while loading state for instance 0x0 of device 's390-skeys' >> +Error: Error -22 while loading VM state >> (qemu) quit >> >> *** done >> Failures: 300 >> Failed 1 of 1 iotests >> >> >> At this point somebody more knowledgeable about QCOW2, coroutines and >> backing files could chime in? >> > <trim> > > I used the reproducer you provide here to do a git bisect as I assume > whatever is now broken wasn't always broken, and it pointed to the > following commit: > > commit df893d25ceea3c0dcbe6d6b425309317fab6b22e (refs/bisect/bad) > Author: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > Date: Tue Jun 4 19:15:13 2019 +0300 > > block/qcow2: implement .bdrv_co_preadv_part > > Indeed, I am currently able to reliable reproduce the issue with this > commit applied, and not reproduce it without it. > > That said, I've not been able to identify exactly what is going wrong. > I'm fairly confident the savevm data is correctly written out, but on > the loadvm side, somehow the last part of the s390 data is not > correctly read in the data (it's in the second pass through the while > loop in qcow2_co_preadv_part() where that happens.) > > If anyone familiar with this code can have a look or provide some > pointers, it would be much appreciated.
Thanks for both your investigation. Does the attached patch help? Max
>From 3df2df667d561dfe423e4ee5784e83943205974a Mon Sep 17 00:00:00 2001 From: Max Reitz <mre...@redhat.com> Date: Tue, 28 Jul 2020 13:02:31 +0200 Subject: [PATCH] block: Fix bdrv_aligned_preadv() with qiov_offset --- block/io.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/block/io.c b/block/io.c index b6564e34c5..bd45d739c7 100644 --- a/block/io.c +++ b/block/io.c @@ -1524,12 +1524,13 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child, assert(num); ret = bdrv_driver_preadv(bs, offset + bytes - bytes_remaining, - num, qiov, bytes - bytes_remaining, 0); + num, qiov, + qiov_offset + bytes - bytes_remaining, 0); max_bytes -= num; } else { num = bytes_remaining; - ret = qemu_iovec_memset(qiov, bytes - bytes_remaining, 0, - bytes_remaining); + ret = qemu_iovec_memset(qiov, qiov_offset + bytes - bytes_remaining, + 0, bytes_remaining); } if (ret < 0) { goto out; -- 2.26.2
signature.asc
Description: OpenPGP digital signature