Am 24.01.2019 um 11:32 hat Vladimir Sementsov-Ogievskiy geschrieben: > 24.01.2019 13:15, Kevin Wolf wrote: > > Am 24.01.2019 um 10:29 hat Vladimir Sementsov-Ogievskiy geschrieben: > >> 23.01.2019 18:48, Max Reitz wrote: > >>> Hi, > >>> > >>> When running 169 in parallel (e.g. like so: > >>> > >>> $ while TEST_DIR=/tmp/t0 ./check -T -qcow2 169; do; done > >>> $ while TEST_DIR=/tmp/t1 ./check -T -qcow2 169; do; done > >>> $ while TEST_DIR=/tmp/t2 ./check -T -qcow2 169; do; done > >>> $ while TEST_DIR=/tmp/t3 ./check -T -qcow2 169; do; done > >>> > >>> in four different shells), I get aborts: > >>> > >>> (Often I get segfaults, but that's because of > >>> http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg05579.html -- > >>> feel free to apply the attached patch to make them go away) > >>> > >>> > >>> WARNING:qemu:qemu received signal 6: > >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 > >>> -chardev socket,id=mon,path=/tmp/t0/tmpbX30XU/qemua-25745-monitor.sock > >>> -mon chardev=mon,mode=control -display none -vga none -qtest > >>> unix:path=/tmp/t0/qemua-25745-qtest.sock -machine accel=qtest > >>> -nodefaults -machine accel=qtest -drive > >>> if=virtio,id=drive0,file=/tmp/t0/disk_a,format=qcow2,cache=writeback > >>> .................E.. > >>> ====================================================================== > >>> ERROR: > >>> test_do_test_migration_resume_source_not_persistent__not_migbitmap > >>> (__main__.TestDirtyBitmapMigration) > >>> ---------------------------------------------------------------------- > >>> Traceback (most recent call last): > >>> File "169", line 206, in <lambda> > >>> setattr(klass, 'test_' + method + name, lambda self: mc(self)) > >>> File "169", line 113, in do_test_migration_resume_source > >>> self.check_bitmap(self.vm_a, sha256) > >>> File "169", line 72, in check_bitmap > >>> node='drive0', name='bitmap0') > >>> File "tests/qemu-iotests/../../scripts/qemu.py", line 369, in qmp > >>> return self._qmp.cmd(cmd, args=qmp_args) > >>> File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 191, in cmd > >>> return self.cmd_obj(qmp_cmd) > >>> File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 174, in > >>> cmd_obj > >>> resp = self.__json_read() > >>> File "tests/qemu-iotests/../../scripts/qmp/qmp.py", line 82, in > >>> __json_read > >>> data = self.__sockfile.readline() > >>> File "/usr/lib64/python2.7/socket.py", line 451, in readline > >>> data = self._sock.recv(self._rbufsize) > >>> error: [Errno 104] Connection reset by peer > >>> > >>> ---------------------------------------------------------------------- > >>> Ran 20 tests > >>> > >>> FAILED (errors=1) > >>> > >>> > >>> Or: > >>> > >>> WARNING:qemu:qemu received signal 6: > >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 > >>> -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock > >>> -mon chardev=mon,mode=control -display none -vga none -qtest > >>> unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults > >>> -machine accel=qtest -drive > >>> if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback > >>> WARNING:qemu:qemu received signal 6: > >>> build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64 > >>> -chardev socket,id=mon,path=/tmp/t3/tmp0pllWD/qemua-3445-monitor.sock > >>> -mon chardev=mon,mode=control -display none -vga none -qtest > >>> unix:path=/tmp/t3/qemua-3445-qtest.sock -machine accel=qtest -nodefaults > >>> -machine accel=qtest -drive > >>> if=virtio,id=drive0,file=/tmp/t3/disk_a,format=qcow2,cache=writeback > >>> > >>> ...................F > >>> ====================================================================== > >>> FAIL: test_do_test_migration_resume_source_persistent__not_migbitmap > >>> (__main__.TestDirtyBitmapMigration) > >>> ---------------------------------------------------------------------- > >>> Traceback (most recent call last): > >>> File "169", line 206, in <lambda> > >>> setattr(klass, 'test_' + method + name, lambda self: mc(self)) > >>> File "169", line 125, in do_test_migration_resume_source > >>> self.assertEqual(log, '') > >>> AssertionError: "qemu-system-x86_64: invalid runstate transition: > >>> 'running' -> 'postmigrate'\n" != '' > >>> > >>> ---------------------------------------------------------------------- > >>> Ran 20 tests > >>> > >>> FAILED (failures=1) > >>> > >>> > >>> The backtrace always goes like this: > >>> > >>> (gdb) bt > >>> #0 0x00007f0acf5cc53f in raise () at /lib64/libc.so.6 > >>> #1 0x00007f0acf5b6895 in abort () at /lib64/libc.so.6 > >>> #2 0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE) > >>> at vl.c:742 > >>> #3 0x000055a46ebbb1a6 in runstate_set > >>> (new_state=new_state@entry=RUN_STATE_POSTMIGRATE) at vl.c:730 > >>> #4 0x000055a46ed39129 in migration_iteration_finish (s=0x55a4708be000) > >>> at migration/migration.c:2972 > >>> #5 0x000055a46ed39129 in migration_thread > >>> (opaque=opaque@entry=0x55a4708be000) at migration/migration.c:3130 > >>> #6 0x000055a46eea665a in qemu_thread_start (args=<optimized out>) at > >>> util/qemu-thread-posix.c:502 > >>> > >>> > >>> #7 0x00007f0acf76258e in start_thread () at /lib64/libpthread.so.0 > >>> #8 0x00007f0acf6916a3 in clone () at /lib64/libc.so.6 > >>> (gdb) frame 2 > >>> #2 0x000055a46ebbb1a6 in runstate_set (new_state=RUN_STATE_POSTMIGRATE) > >>> at vl.c:742 > >>> 742 abort(); > >>> (gdb) print current_run_state > >>> $1 = RUN_STATE_RUNNING > >>> > >>> > >>> Neither of migration or runstates are my strong suite, so I thought I'd > >>> report it before diving into it. > >>> > >>> Max > >>> > >> > >> [...] > >> 450556@1548321072.888979:migrate_set_state new state active > >> 450556@1548321072.889022:migration_thread_setup_complete > >> 450556@1548321072.988275:migrate_transferred transferred 985298 time_spent > >> 100 bandwidth 9852 max_size 2955894 > >> 450556@1548321072.988283:migration_bitmap_sync_start > >> 450556@1548321072.988293:migration_bitmap_sync_end dirty_pages 32898 > >> 450556@1548321072.988495:migration_thread_low_pending 2048 > >> migration_completion > >> 450556@1548321072.988541:runstate_set current_state 9 new_state 7 > >> 450556@1548321072.988780:migration_bitmap_sync_start > >> 450556@1548321072.988790:migration_bitmap_sync_end dirty_pages 32898 > >> 450556@1548321072.993112:migrate_global_state_pre_save saved state: running > >> [1] 450556@1548321072.993237:migrate_set_state new state completed > >> 450549@1548321072.993668:handle_qmp_command mon 0x55aebef99830 req: > >> {"execute": "x-debug-block-dirty-bitmap-sha256", "arguments": {"name": > >> "bitmap0", "node": "drive0"}} > >> 450556@1548321072.993697:migration_thread_after_loop > >> [2] 450549@1548321072.993917:handle_qmp_command mon 0x55aebef99830 req: > >> {"execute": "cont"} > >> 450476@1548321072.994461:runstate_set current_state 7 new_state 9 > >> qemu-system-x86_64: invalid runstate transition: 'running' -> 'postmigrate' > >> > >> > >> Aha, so, in test we wait for MIGRATION COMPLETED [1] and run qmp_cont [2] > >> and go to RUNNING. > >> then in migration thread we go to migration_iteration_finish() and fail to > >> go to POSTMIGRATE. > >> (note, that this testcase is about resuming source after migration) > >> > >> So, fix for test may be additionally wait for POSTMIGRATE, not only for > >> MIGRATION COMPLETION. > >> > >> But how to fix Qemu not to crash? May be, forbid some transitions > >> (FINISH_MIGRATE -> RUNNING), > >> or at least error-out qmp_cont if runstate is FINISH_MIGRATE? > > > > qmp_cont currently checks for RUN_STATE_INMIGRATE and gives it special > > treatment. As a quick hack, doing the same for RUN_STATE_FINISH_MIGRATE > > might fix the problem. > > A better simple fix would possibly include a > > whitelist of states where you can use 'cont' to avoid that the user > > messes with other internal states. > > > > However, like the problem that Dave reported a few days ago (migrating > > twice leads to a crash because we try to inactivate already inactive > > block devices), I think this is a symptom of a larger problem. We're > > not taking the state machine serious enough. One-off checks here and > > there are unlikely to ever fully solve the problem of running a command > > in a runstate in which it was never supposed to run. > > > > I wonder whether the QAPI schema should have a field 'run-states' for > > commands, and by default we would only include states where the VM has > > ownership of its resources (e.g. images are activated) and which are not > > temporary states that are automatically left, like finish-migrate. > > > > Then the default for commands is to be rejected in "unusual" runstates > > where we're not expecting user intervention, and we must explicitly > > allow them if they are okay, in fact. > > > > Instead of listing every obscure runstate, maybe we should really use > > categories of runstates instead: > > > > 1. Running > > 2. Paused, owns all resources (like disk images) > > 3. Paused, doesn't own some resources (source VM after migration > > completes, destination before migration completes) > > 4. Paused temporarily for internal reasons (e.g. finish-migrate, > > restore-vm, save-vm) > > > > Most commands should be okay with 1 and 2, but possibly not 3, and > > almost never 4. > > And then we need a mechanism to block state transition while some > operations which don't support new state are ongoing.
This essentially means not allowing those QMP commands that could cause a state transition (primarily cont/stop, but probably some more). State transitions that happen automatically should in theory always be correct, everything else is a bug and abort() is the right way to stop them (like we already do). Kevin