On Mon, Mar 29, 2021 at 05:43:28PM +0200, Max Reitz wrote: > On 29.03.21 17:05, Stefan Hajnoczi wrote: > > Hi Kevin, > > Peter hit an s390 qemu-iotests failure. I was able to reproduce it > > easily. I haven't checked if it reproduce on other platforms too, but it > > seems likely. Here is what I found. > > > > qsd-jobs has race conditions: > > > > # Just make sure that this doesn't crash > > $QSD --chardev stdio,id=stdio --monitor chardev=stdio \ > > --blockdev node-name=file0,driver=file,filename="$TEST_IMG" \ > > --blockdev node-name=fmt0,driver=qcow2,file=file0 <<EOF | _filter_qmp > > {"execute":"qmp_capabilities"} > > {"execute": "block-commit", "arguments": {"device": "fmt0", "job-id": > > "job0"}} > > {"execute": "quit"} > > EOF > > > > The intent is for "quit" to run while job0 still exists. This proves > > that the program does not abort with an assertion failure when "quit" is > > called while there are block jobs. bdrv_close_all() expects there to be > > no jobs. > > > > Here is the failure that reproduces 1 out of 3 times: > > > > # build/tests/qemu-iotests/check -qcow2 qsd-jobs > > QEMU -- > > "/root/qemu/build/tests/qemu-iotests/../../qemu-system-x86_64" -nodefaults > > -display none -accel qtest > > QEMU_IMG -- "/root/qemu/build/tests/qemu-iotests/../../qemu-img" > > QEMU_IO -- "/root/qemu/build/tests/qemu-iotests/../../qemu-io" > > --cache writeback --aio threads -f qcow2 > > QEMU_NBD -- "/root/qemu/build/tests/qemu-iotests/../../qemu-nbd" > > IMGFMT -- qcow2 > > IMGPROTO -- file > > PLATFORM -- Linux/s390x > > TEST_DIR -- /root/qemu/scratch > > SOCK_DIR -- /tmp/tmpapj5dydm > > SOCKET_SCM_HELPER -- > > /root/qemu/build/tests/qemu-iotests/socket_scm_helper > > > > qsd-jobs fail [10:42:23] [10:42:23] 0.1s (last: 0.1s) > > output mismatch (see qsd-jobs.out.bad) > > --- /root/qemu/tests/qemu-iotests/tests/qsd-jobs.out > > +++ qsd-jobs.out.bad > > @@ -9,11 +9,11 @@ > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "job0"}} > > {"return": {}} > > +{"return": {}} > > +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "job0"}} > > +{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "BLOCK_JOB_READY", "data": {"device": "job0", "len": 0, "offset": > > 0, "speed": 0, "type": "commit"}} > > -{"return": {}} > > -{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "job0"}} > > -{"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "job0"}} > > {"timestamp": {"seconds": TIMESTAMP, "microseconds": TIMESTAMP}, > > "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job0", "len": 0, > > "offset": 0, "speed": 0, "type": "commit"}} > > Failures: qsd-jobs > > Failed 1 of 1 iotests > > > > The timing of the QMP events seems to be non-deterministic relative to > > the QMP command completions. > > > > It's more complicated than simply lines moving up/down in the expected > > output: the job state transitions are slightly different (running -> > > ready vs running -> paused). > > > > Should the test case filter out all JOB_* and BLOCK_JOB_* events? > > > > Is there a way to make this test deterministic? > > Would writing data to the overlay and setting a very restrictive speed help, > so it doesn’t immediately go into the 'ready' state? (See attachment.) > > Otherwise, I think we could take my “Allow using the QSD” patch > (https://lists.nongnu.org/archive/html/qemu-block/2021-03/msg00654.html) to > rewrite the patch to use common.qemu to test QSD. With that we could issue > the block-commit command, wait for BLOCK_JOB_READY, and only then issue > 'quit'.
Ideally we would have the block job completely controlled by the test. Either using blkdebug to suspend an I/O request or by having a test job that is designed for entering specific states and waiting for the test. I think both your ideas are an improvement on the current test and am happy with either. Fully controlling job execution isn't simple so I think it's okay to merge a smaller solution even if it relies on internals or is still a race condition (but a really long one!). Stefan
signature.asc
Description: PGP signature