Hello Tomas, 01.09.2023 16:00, Tomas Vondra wrote:
Hmmm, I'm not very good at reading the binary code, but here's what objdump produced for WaitEventSetWait. Maybe someone will see what the issue is.
At first glance, I can't see anything suspicious in the disassembly. IIUC, waiting = true presented there as: 805c38: b902ad18 str w24, [x8, #684] // pgstat_report_wait_start(): proc->wait_event_info = wait_event_info; // end of pgstat_report_wait_start(wait_event_info); 805c3c: b0ffdb09 adrp x9, 0x366000 <dsm_segment_address+0x24> 805c40: b0ffdb0a adrp x10, 0x366000 <dsm_segment_address+0x28> 805c44: f0000eeb adrp x11, 0x9e4000 <PMSignalShmemInit+0x4> 805c48: 52800028 mov w8, #1 // true 805c4c: 52800319 mov w25, #24 805c50: 5280073a mov w26, #57 805c54: fd446128 ldr d8, [x9, #2240] 805c58: 90000d7b adrp x27, 0x9b1000 <ModifyWaitEvent+0xb0> 805c5c: fd415949 ldr d9, [x10, #688] 805c60: f9071d68 str x8, [x11, #3640] // waiting = true (x8 = w8) So there are two simple mov's and two load operations performed in parallel, but I don't think it's similar to what we had in that case.
I thought about maybe just adding the barrier in the code, but then how would we know it's the issue and this fixed it? It happens so rarely we can't make any conclusions from a couple runs of tests.
Probably I could construct a reproducer for the lockup if I had access to the such machine for a day or two. Best regards, Alexander