Am 02.05.23 um 12:03 schrieb Fiona Ebner:
> Am 28.04.23 um 18:54 schrieb Juan Quintela:
>> Kevin Wolf <kw...@redhat.com> wrote:
>>> Am 28.04.2023 um 10:38 hat Juan Quintela geschrieben:
>>>> Kevin Wolf <kw...@redhat.com> wrote:
>>>>>> I am perhaps a bit ingenuous here, but it is there a way to convince
>>>>>> qemu that snapshot_save_job_bh *HAS* to run on the main thread?
>>>>>
>>>>> I believe we're talking about a technicality here. I asked another more
>>>>> fundamental question that nobody has answered yet:
>>>>>
>>>>> Why do you think that it's ok to call bdrv_writev_vmstate() without
>>>>> holding the BQL?
>>>>
>>>> I will say this function starts by bdrv_ (i.e. block layer people) and
>>>> endes with _vmstate (i.e. migration people).
>>>>
>>>> To be honest, I don't know.  That is why I _supposed_ you have an idea.
>>>
>>> My idea is that bdrv_*() can only be called when you hold the BQL, or
>>> for BlockDriverStates in an iothread the AioContext lock.
>>>
>>> Apparently dropping the BQL in migration code was introduced in Paolo's
>>> commit 9b095037527.
>>
>> Damn.  I reviewed it, so I am as guilty as the author.
>> 10 years later without problems I will not blame that patch.
>>
>> I guess we changed something else that broke doing it without the lock.
>>
>> But no, I still don't have suggestions/ideas.
>>
> 
> I do feel like the issue might be very difficult to trigger under normal
> circumstances. Depending on the configuration and what you do in the
> guest, aio_poll in a vCPU thread does not happen often and I imagine
> snapshot-save is also not a super frequent operation for most people. It
> still takes me a while to trigger the issue by issuing lots of pflash
> writes and running snapshot-save in a loop, I'd guess about 30-60
> snapshots. Another reason might be that generated co-wrappers were less
> common in the past?
> 
>>> I'm not sure what this was supposed to improve in
>>> the case of snapshots because the VM is stopped anyway.
> 
> Is it? Quoting Juan:> d- snapshots are a completely different beast,
> that don't really stop
>>    the guest in the same way at that point, and sometimes it shows in
>>    this subtle details.
> 
>>> Would anything bad happen if we removed the BQL unlock/lock section in
>>> qemu_savevm_state() again?
>>
>> Dunno.
>>
>> For what is worth, I can say that it survives migration-test, but don't
>> ask me why/how/...
>>
>> Fiona, can you check if it fixes your troubles?
>>
> 
> Just removing the single section in qemu_savevm_state() breaks even the
> case where snapshot_save_job_bh() is executed in the main thread,
> because ram_init_bitmaps() will call qemu_mutex_lock_iothread_impl()
> which asserts that it's not already locked.
> 
> Also removing the lock/unlock pair in ram_init_bitmaps() seems to work. 

Well, after a few more attempts, I got a new failure (running with the
two changes mentioned above), but it seems to happen later and, at a
first glance, doesn't seem to be related to the lock anymore:

> qemu-system-x86_64: ../hw/net/virtio-net.c:3811: virtio_net_pre_save: 
> Assertion `!n->vhost_started' failed.

> (gdb) call qemu_mutex_iothread_locked()
> $1 = true

Backtrace:

> Thread 21 "CPU 0/KVM" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7fd291ffb700 (LWP 136620)]
> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50    ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007fd2b8b3e537 in __GI_abort () at abort.c:79
> #2  0x00007fd2b8b3e40f in __assert_fail_base (fmt=0x7fd2b8cb66a8 "%s%s%s:%u: 
> %s%sAssertion `%s' failed.\n%n", assertion=0x557e0ebb7193 
> "!n->vhost_started", file=0x557e0ebb65f0 "../hw/net/virtio-net.c", line=3811, 
> function=<optimized out>) at assert.c:92
> #3  0x00007fd2b8b4d662 in __GI___assert_fail (assertion=0x557e0ebb7193 
> "!n->vhost_started", file=0x557e0ebb65f0 "../hw/net/virtio-net.c", line=3811, 
> function=0x557e0ebb7740 <__PRETTY_FUNCTION__.3> "virtio_net_pre_save") at 
> assert.c:101
> #4  0x0000557e0e6c21c9 in virtio_net_pre_save (opaque=0x557e12c480b0) at 
> ../hw/net/virtio-net.c:3811
> #5  0x0000557e0e4f63ee in vmstate_save_state_v (f=0x7fd28442f730, 
> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0, 
> vmdesc=0x7fd2840978e0, version_id=11) at ../migration/vmstate.c:330
> #6  0x0000557e0e4f6390 in vmstate_save_state (f=0x7fd28442f730, 
> vmsd=0x557e0efb53c0 <vmstate_virtio_net>, opaque=0x557e12c480b0, 
> vmdesc_id=0x7fd2840978e0) at ../migration/vmstate.c:318
> #7  0x0000557e0e51c80a in vmstate_save (f=0x7fd28442f730, se=0x557e12c4c430, 
> vmdesc=0x7fd2840978e0) at ../migration/savevm.c:1000
> #8  0x0000557e0e51d942 in qemu_savevm_state_complete_precopy_non_iterable 
> (f=0x7fd28442f730, in_postcopy=false, inactivate_disks=false) at 
> ../migration/savevm.c:1463
> #9  0x0000557e0e51db33 in qemu_savevm_state_complete_precopy 
> (f=0x7fd28442f730, iterable_only=false, inactivate_disks=false) at 
> ../migration/savevm.c:1529
> #10 0x0000557e0e51df3d in qemu_savevm_state (f=0x7fd28442f730, 
> errp=0x7fd28425e1f8) at ../migration/savevm.c:1635
> #11 0x0000557e0e520548 in save_snapshot (name=0x7fd28425e2b0 "snap0", 
> overwrite=false, vmstate=0x7fd284479c20 "scsi0", has_devices=true, 
> devices=0x7fd284097920, errp=0x7fd28425e1f8) at ../migration/savevm.c:2952
> #12 0x0000557e0e520f9b in snapshot_save_job_bh (opaque=0x7fd28425e130) at 
> ../migration/savevm.c:3251
> #13 0x0000557e0e9f94de in aio_bh_call (bh=0x7fd2844703f0) at 
> ../util/async.c:155
> #14 0x0000557e0e9f95e9 in aio_bh_poll (ctx=0x557e110ae910) at 
> ../util/async.c:184
> #15 0x0000557e0e9dd78c in aio_poll (ctx=0x557e110ae910, blocking=true) at 
> ../util/aio-posix.c:721
> #16 0x0000557e0e836ef0 in bdrv_poll_co (s=0x7fd291ff5eb0) at 
> /home/febner/repos/qemu/block/block-gen.h:43
> #17 0x0000557e0e839c0e in blk_pwrite (blk=0x557e11320f40, offset=58880, 
> bytes=512, buf=0x7fd290c0e600, flags=0) at block/block-gen.c:1650
> #18 0x0000557e0e2b2078 in pflash_update (pfl=0x557e11303b30, offset=58880, 
> size=1) at ../hw/block/pflash_cfi01.c:394
> #19 0x0000557e0e2b2749 in pflash_write (pfl=0x557e11303b30, offset=59338, 
> value=62, width=1, be=0) at ../hw/block/pflash_cfi01.c:522
> #20 0x0000557e0e2b2cda in pflash_mem_write_with_attrs (opaque=0x557e11303b30, 
> addr=59338, value=62, len=1, attrs=...) at ../hw/block/pflash_cfi01.c:681
> #21 0x0000557e0e733354 in memory_region_write_with_attrs_accessor 
> (mr=0x557e11303ef0, addr=59338, value=0x7fd291ff60c8, size=1, shift=0, 
> mask=255, attrs=...) at ../softmmu/memory.c:514
> #22 0x0000557e0e733493 in access_with_adjusted_size (addr=59338, 
> value=0x7fd291ff60c8, size=1, access_size_min=1, access_size_max=4, 
> access_fn=0x557e0e73325a <memory_region_write_with_attrs_accessor>, 
> mr=0x557e11303ef0, attrs=...) at ../softmmu/memory.c:555
> #23 0x0000557e0e7365c8 in memory_region_dispatch_write (mr=0x557e11303ef0, 
> addr=59338, data=62, op=MO_8, attrs=...) at ../softmmu/memory.c:1522
> #24 0x0000557e0e7436de in flatview_write_continue (fv=0x7fd2848bce90, 
> addr=4290832330, attrs=..., ptr=0x7fd2ba9b7028, len=1, addr1=59338, l=1, 
> mr=0x557e11303ef0) at ../softmmu/physmem.c:2641
> #25 0x0000557e0e743841 in flatview_write (fv=0x7fd2848bce90, addr=4290832330, 
> attrs=..., buf=0x7fd2ba9b7028, len=1) at ../softmmu/physmem.c:2683
> #26 0x0000557e0e743bf1 in address_space_write (as=0x557e0f3abb20 
> <address_space_memory>, addr=4290832330, attrs=..., buf=0x7fd2ba9b7028, 
> len=1) at ../softmmu/physmem.c:2779
> #27 0x0000557e0e743c5e in address_space_rw (as=0x557e0f3abb20 
> <address_space_memory>, addr=4290832330, attrs=..., buf=0x7fd2ba9b7028, 
> len=1, is_write=true) at ../softmmu/physmem.c:2789
> #28 0x0000557e0e7d7a5c in kvm_cpu_exec (cpu=0x557e11676910) at 
> ../accel/kvm/kvm-all.c:2989
> #29 0x0000557e0e7da76e in kvm_vcpu_thread_fn (arg=0x557e11676910) at 
> ../accel/kvm/kvm-accel-ops.c:51
> #30 0x0000557e0e9e2233 in qemu_thread_start (args=0x557e11553fb0) at 
> ../util/qemu-thread-posix.c:541
> #31 0x00007fd2b9b17ea7 in start_thread (arg=<optimized out>) at 
> pthread_create.c:477
> #32 0x00007fd2b8c18a2f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Reply via email to