On Tue, 2020-07-28 at 11:43 +0300, Vladimir Sementsov-Ogievskiy wrote: > 28.07.2020 02:09, Bruce Rogers wrote: > > On Tue, 2020-07-21 at 10:22 +0200, Claudio Fontana wrote: > > > On 7/20/20 8:24 PM, Claudio Fontana wrote: > > > > I have now been able to reproduce this on X86 as well. > > > > > > > > It happens much more rarely, about once every 10 times. > > > > > > > > I will sort out the data and try to make it even more > > > > reproducible, > > > > then post my findings in detail. > > > > > > > > Overall I proceeded as follows: > > > > > > > > 1) hooked the savevm code to skip all fields with the exception > > > > of > > > > "s390-skeys". So only s390-skeys are actually saved. > > > > > > > > 2) reimplemented "s390-skeys" in a common implementation in > > > > cpus.c, > > > > used on both x86 and s390, modeling the behaviour of save/load > > > > from > > > > hw/s390 > > > > > > > > 3) ran ./check -qcow2 267 on both x86 and s390. > > > > > > > > In the case of s390, failure seems to be reproducible 100% of > > > > the > > > > times. > > > > On X86, it is as mentioned failing about 10% of the times. > > > > > > > > Ciao, > > > > > > > > Claudio > > > > > > And here is a small series of two patches that can be used to > > > reproduce the problem. > > > > > > Clearly, this is not directly related to s390 or to skeys or to > > > icount in particular, it is just an issue that happened to be > > > more > > > visible there. > > > > > > If you could help with this, please apply the attached patches. > > > > > > Patch 1 just adds a new "300" iotest. It is way easier to extract > > > the > > > relevant part out of test 267, which does a bit too much in the > > > same > > > file. > > > Also this allows easier use of valgrind, since it does not > > > "require" > > > anything. > > > > > > Patch 2 hooks the savevm code to skip all fields during the > > > snapshot > > > with the exception of "s390-skeys", a new artificial field > > > implemented to > > > model what the real s390-skeys is doing. > > > > > > After applying patch 1 and patch 2, you can test (also on X86), > > > with: > > > > > > ./check -qcow2 300 > > > > > > On X86 many runs will be successful, but a certain % of them will > > > instead fail like this: > > > > > > > > > claudio@linux-ch70:~/git/qemu-pristine/qemu-build/tests/qemu- > > > iotests> > > > ./check -qcow2 300 > > > QEMU -- "/home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64" > > > -nodefaults -display none -accel qtest > > > QEMU_IMG -- "/home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/../../qemu-img" > > > QEMU_IO -- "/home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/../../qemu-io" --cache writeback --aio > > > threads -f qcow2 > > > QEMU_NBD -- "/home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/../../qemu-nbd" > > > IMGFMT -- qcow2 (compat=1.1) > > > IMGPROTO -- file > > > PLATFORM -- Linux/x86_64 linux-ch70 4.12.14-lp151.28.36- > > > default > > > TEST_DIR -- /home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/scratch > > > SOCK_DIR -- /tmp/tmp.gdcUu3l0SM > > > SOCKET_SCM_HELPER -- /home/claudio/git/qemu-pristine/qemu- > > > build/tests/qemu-iotests/socket_scm_helper > > > > > > 300 fail [10:14:05] [10:14:06] (last: > > > 0s) output > > > mismatch (see 300.out.bad) > > > --- /home/claudio/git/qemu-pristine/qemu/tests/qemu- > > > iotests/300.out 2020-07-21 10:03:54.468104764 +0200 > > > +++ /home/claudio/git/qemu-pristine/qemu-build/tests/qemu- > > > iotests/300.out.bad 2020-07-21 10:14:06.098090543 +0200 > > > @@ -12,6 +12,9 @@ > > > ID TAG VM > > > SIZE DATE VM > > > CLOCK > > > -- snap0 SIZE yyyy-mm-dd > > > hh:mm:ss 00:00:00.000 > > > (qemu) loadvm snap0 > > > +Unexpected storage key data: 0 > > > +error while loading state for instance 0x0 of device 's390- > > > skeys' > > > +Error: Error -22 while loading VM state > > > (qemu) quit > > > > > > *** done > > > Failures: 300 > > > Failed 1 of 1 iotests > > > > > > > > > At this point somebody more knowledgeable about QCOW2, coroutines > > > and > > > backing files could chime in? > > > > > <trim> > > > > I used the reproducer you provide here to do a git bisect as I > > assume > > whatever is now broken wasn't always broken, and it pointed to the > > following commit: > > > > commit df893d25ceea3c0dcbe6d6b425309317fab6b22e (refs/bisect/bad) > > Author: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > > Date: Tue Jun 4 19:15:13 2019 +0300 > > > > block/qcow2: implement .bdrv_co_preadv_part > > > > Indeed, I am currently able to reliable reproduce the issue with > > this > > commit applied, and not reproduce it without it. > > Do you have a reproducer on top of df893d25ceea? I tried to apply the > patches on it, there are a lot of conflicts and after solving them, > the test not work for me. > > What do you mean by not reproduce without df893d25ceea3c0d? I also > tried to do "git rebase df893d25ceea3c0d --onto df893d25ceea3c0d^", > after applying the reproducing patches on master, but again, there > are conflicts.. > > How did you applied > > commit 88f468e54696cd9ffc28ecf96f18f04125402541 > Author: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > Date: Mon Sep 16 20:53:22 2019 +0300 > > block/qcow2: refactor qcow2_co_preadv_part > > and, which is more important further > > commit d710cf575ad5fb3ab329204620de45bfe50caa53 > Author: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > Date: Mon Sep 16 20:53:24 2019 +0300 > > block/qcow2: introduce parallel subrequest handling in read and > write > > without df893d25ceea3c0d ? > > > I can reproduce the bug on master with Claudio's two patches, but how > to reproduce the relation with df893d25ceea3c0d? > >
Yeah, I didn't provide full reproducer info for the point where the patch I mention got applied. No matter, it looks like a fix has been identified in the mean time. Thanks for the quick response, though. Bruce