On 2026/05/19 21:45, Alex Bennée wrote:
Alex Bennée <[email protected]> writes:

Akihiko Odaki <[email protected]> writes:

On 2026/05/19 4:35, Alex Bennée wrote:
Akihiko Odaki <[email protected]> writes:

This fixes a deadlock I previously observed with the test in [1].

However, I can no longer reproduce the issue reliably with that test, so
I used Codex, a coding agent, to write a more reliable local test case,
shown below. I applied to Codex for Open Source to get access. The test
case is not intended for merge: current policy prohibits that, and it is
probably not worth carrying anyway because race-condition tests are
inherently fragile.
What sort of hit rate where you getting with the race? So far they
have
both been rock solid without the additional patches for me.

I hit the deadlock in 8 out of 10 trials.

It's taking a lot longer on my system (~ 1 in 100) but with these
patches I'm still seeing a hang, it just takes a lot longer to get
there.

tsan shows:

[INFO] mapping blob object resource
[INFO] resource_map_blob response is CtrlHeader { hdr_type: Command(4358), 
flags: 0, fence_id: 0, ctx_id: 0, _padding: 0 }
[INFO] unmapping blob object resource
==================
WARNING: ThreadSanitizer: data race (pid=3564641)
   Write of size 8 at 0x55c8ce6d4250 by thread T1 (mutexes: write M0, write M1):
     #0 qemu_ram_free <null> (qemu-system-aarch64+0x98f863) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #1 memory_region_destructor_ram <null> (qemu-system-aarch64+0x977046) 
(BuildId: 9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #2 memory_region_finalize <null> (qemu-system-aarch64+0x9830e5) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #3 object_unref <null> (qemu-system-aarch64+0xfa741c) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #4 object_finalize_child_property <null> (qemu-system-aarch64+0xfa765f) 
(BuildId: 9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #5 object_unref <null> (qemu-system-aarch64+0xfa73d6) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #6 flatview_destroy <null> (qemu-system-aarch64+0x978e7d) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #7 call_rcu_thread <null> (qemu-system-aarch64+0x122e268) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #8 qemu_thread_start <null> (qemu-system-aarch64+0x121cc8d) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
Previous atomic read of size 8 at 0x55c8ce6d4250 by thread T7:
     #0 qemu_ram_block_from_host <null> (qemu-system-aarch64+0x98fabb) 
(BuildId: 9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #1 qemu_ram_addr_from_host_nofail <null> (qemu-system-aarch64+0x98ff16) 
(BuildId: 9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #2 get_page_addr_code_hostp <null> (qemu-system-aarch64+0x4bbd0b) 
(BuildId: 9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #3 tb_htable_lookup <null> (qemu-system-aarch64+0x49f7bc) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #4 cpu_exec_loop <null> (qemu-system-aarch64+0x4a08a5) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #5 cpu_exec_setjmp <null> (qemu-system-aarch64+0x4a112b) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #6 cpu_exec <null> (qemu-system-aarch64+0x4a1b74) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #7 tcg_cpu_exec <null> (qemu-system-aarch64+0x4cb92b) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #8 mttcg_cpu_thread_fn <null> (qemu-system-aarch64+0x4cbe81) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #9 do_st2_mmu <null> (qemu-system-aarch64+0x4ba389) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #10 helper_stw_mmu <null> (qemu-system-aarch64+0x4bc571) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #11 <null> <null> (0x7f936faabdb2)
     #12 cpu_exec_loop <null> (qemu-system-aarch64+0x4a04fc) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #13 cpu_exec_setjmp <null> (qemu-system-aarch64+0x4a112b) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #14 cpu_loop_exit_noexc <null> (qemu-system-aarch64+0x4a2242) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #15 cpu_io_recompile <null> (qemu-system-aarch64+0x4b0a9b) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #16 do_ld_mmio_beN <null> (qemu-system-aarch64+0x4b47c9) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #17 do_ld2_mmu <null> (qemu-system-aarch64+0x4b93aa) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #18 helper_lduw_mmu <null> (qemu-system-aarch64+0x4bc0a7) (BuildId: 
9e57c19eb7cc79d8195b5fb05324859b4db6fbbc)
     #19 <null> <null> (0x7f936faab758)

<snip>

So I guess we are trying to free the memory while still running?

Probably not. qemu_ram_free() is named "free", but the RAMBlock itself is reclaimed only after an RCU grace period. So the vCPU may still observe the old RAMBlock while walking the RAMBlock list/MRU cache, and that is an expected part of the lifetime scheme.

I think TSan is more likely complaining about ram_list.mru_block. The read side uses an atomic RCU load:

    block = qatomic_rcu_read(&ram_list.mru_block);

but qemu_ram_free() clears it with a plain store:

    ram_list.mru_block = NULL;

If we want to fix this TSan report, the stores to ram_list.mru_block should be made atomic as well. In qemu_ram_free(), qatomic_set_mb() would also provide the barrier needed before updating ram_list.version, so the explicit smp_wmb() there could go away.

This looks distinct from the remaining hang/deadlock, though. For that, could you collect the thread backtraces when QEMU is stuck? That should show which threads are actually waiting on each other, instead of an incidental TSan report from the RAMBlock MRU cache.

Reply via email to