On 01.10.19 23:59, Richard Henderson wrote: > On 10/1/19 12:47 PM, David Hildenbrand wrote: >> On 01.10.19 21:17, Richard Henderson wrote: >>> On 10/1/19 11:16 AM, David Hildenbrand wrote: >>>> +static inline bool should_interrupt_instruction(CPUState *cs) >>>> +{ >>>> + /* >>>> + * Something asked us to stop executing chained TBs, e.g., >>>> + * cpu_interrupt() or cpu_exit(). >>>> + */ >>>> + if ((int32_t)atomic_read(&cpu_neg(cs)->icount_decr.u32) < 0) { >>>> + return true; >>>> + } >>>> + >>>> + /* We have a deliverable interrupt pending. */ >>>> + if ((atomic_read(&cs->interrupt_request) & CPU_INTERRUPT_HARD) && >>>> + s390_cpu_has_int(S390_CPU(cs))) { >>>> + return true; >>>> + } >>>> + return false; >>>> +} >>> >>> The first condition should be true whenever the second condition is true. >> >> @@ -1018,6 +1018,7 @@ static inline bool >> should_interrupt_instruction(CPUState *cs) >> /* We have a deliverable interrupt pending. */ >> if ((atomic_read(&cs->interrupt_request) & CPU_INTERRUPT_HARD) && >> s390_cpu_has_int(S390_CPU(cs))) { >> + g_assert((int32_t)atomic_read(&cpu_neg(cs)->icount_decr.u32) < 0); >> return true; >> } >> return false; >> >> >> ... >> >> >> [ 60.109761] systemd[1]: Set hostname to <rhel8>. >> ** >> ERROR:/home/dhildenb/git/qemu/target/s390x/mem_helper.c:1021:should_interrupt_instruction: >> assertion failed: ((int32_t)atomic_read(&cpu_neg(cs)->icount_decr.u32) < 0) >> >> >> A race? Roughly 20-30% pass the first but not the second check. And >> in total, on a Fedora 30 boot, I can maybe see 30 calls of >> should_interrupt_instruction() succeeding. >> >> I thought these could be pending interrupts that were not deliverable >> when injected but are now deliverable. For these, >> icount_decr.u32.high would already have been set to 0. >> >> OTOH, I guess we always exit the TB in case we change the "deliverable" state >> of an IRQ, e.g., after LPSW or LCTL. E.g., >> >> static DisasJumpType op_lctlg(DisasContext *s, DisasOps *o) >> { >> ... >> /* Exit to main loop to reevaluate s390_cpu_exec_interrupt. */ >> return DISAS_PC_STALE_NOCHAIN; >> } >> >> Maybe really a race then - or we are not properly exiting back to the >> main loop in all scenarios. > > I think that it's a race right here in should_interrupt_instruction. > > Notice, interrupt_request gets set before icount_decr. Indeed, the barrier > happens immediately before the set of icount_decr in cpu_exit(). > > (It is briefly confusing that we have a barrier in cpu_exit and not in > tcg_handle_interrupt. But that's explained by the cpu_is_self -- no need for > a > barrier for the current cpu. I also think we could usefully use > atomic_store_release instead of a separate smp_wmb.) > > Therefore checking interrupt_request after checking icount_decr violates the > ordering rules. > > This is confirmed, ish, by noticing putting a breakpoint at that second return > (or assert) and noticing that icount_decr.u16.hi == -1. It did get set by one > of the other threads, and before gdb managed to stop the world.
I am still puzzled why I can trigger that many races. I'll do some more investigation, however, if there would be a deliverable interrupt which will not set icount_decr.u16.hi == -1, then we would have to fix something that ionstead. So the second check in this patch can indeed go, thanks for clarifying! -- Thanks, David / dhildenb