On Mon, Apr 13, 2020 at 09:05:15PM -0700, Nathan Chancellor wrote: > On Tue, Apr 14, 2020 at 12:05:53PM +1000, David Gibson wrote: > > On Sat, Apr 11, 2020 at 11:57:23PM +1000, Nicholas Piggin wrote: > > > Nicholas Piggin's on April 11, 2020 7:32 pm: > > > > Nathan Chancellor's on April 11, 2020 10:53 am: > > > >> The tt.config values are needed to reproduce but I did not verify that > > > >> ONLY tt.config was needed. Other than that, no, we are just building > > > >> either pseries_defconfig or powernv_defconfig with those configs and > > > >> letting it boot up with a simple initramfs, which prints the version > > > >> string then shuts the machine down. > > > >> > > > >> Let me know if you need any more information, cheers! > > > > > > > > Okay I can reproduce it. Sometimes it eventually recovers after a long > > > > pause, and some keyboard input often helps it along. So that seems like > > > > it might be a lost interrupt. > > > > > > > > POWER8 vs POWER9 might just be a timing thing if P9 is still hanging > > > > sometimes. I wasn't able to reproduce it with defconfig+tt.config, I > > > > needed your other config with various other debug options. > > > > > > > > Thanks for the very good report. I'll let you know what I find. > > > > > > It looks like a qemu bug. Booting with '-d int' shows the decrementer > > > simply stops firing at the point of the hang, even though MSR[EE]=1 and > > > the DEC register is wrapping. Linux appears to be doing the right thing > > > as far as I can tell (not losing interrupts). > > > > > > This qemu patch fixes the boot hang for me. I don't know that qemu > > > really has the right idea of "context synchronizing" as defined in the > > > powerpc architecture -- mtmsrd L=1 is not context synchronizing but that > > > does not mean it can avoid looking at exceptions until the next such > > > event. It looks like the decrementer exception goes high but the > > > execution of mtmsrd L=1 is ignoring it. > > > > > > Prior to the Linux patch 3282a3da25b you bisected to, interrupt replay > > > code would return with an 'rfi' instruction as part of interrupt return, > > > which probably helped to get things moving along a bit. However it would > > > not be foolproof, and Cedric did say he encountered some mysterious > > > lockups under load with qemu powernv before that patch was merged, so > > > maybe it's the same issue? > > > > > > Thanks, > > > Nick > > > > > > The patch is a bit of a hack, but if you can run it and verify it fixes > > > your boot hang would be good. > > > > So a bug in this handling wouldn't surprise me at all. However a > > report against QEMU 3.1 isn't particularly useful. > > > > * Does the problem occur with current upstream master qemu? > > Yes, I can reproduce the hang on 5.0.0-rc2.
Ok. Nick, can you polish up your fix shortly and submit upstream in the usual fashion? > > * Does the problem occur with qemu-2.12 (a pretty widely deployed > > "stable" qemu, e.g. in RHEL)? > > No idea but I would assume so. I might have time later this week to test > but I assume it is kind of irrelevant if it is reproducible at ToT. > > > > --- > > > > > > diff --git a/target/ppc/translate.c b/target/ppc/translate.c > > > index b207fb5386..1d997f5c32 100644 > > > --- a/target/ppc/translate.c > > > +++ b/target/ppc/translate.c > > > @@ -4364,12 +4364,21 @@ static void gen_mtmsrd(DisasContext *ctx) > > > if (ctx->opcode & 0x00010000) { > > > /* Special form that does not need any synchronisation */ > > > TCGv t0 = tcg_temp_new(); > > > + TCGv t1 = tcg_temp_new(); > > > tcg_gen_andi_tl(t0, cpu_gpr[rS(ctx->opcode)], > > > (1 << MSR_RI) | (1 << MSR_EE)); > > > - tcg_gen_andi_tl(cpu_msr, cpu_msr, > > > + tcg_gen_andi_tl(t1, cpu_msr, > > > ~(target_ulong)((1 << MSR_RI) | (1 << MSR_EE))); > > > - tcg_gen_or_tl(cpu_msr, cpu_msr, t0); > > > + tcg_gen_or_tl(t1, t1, t0); > > > + > > > + gen_update_nip(ctx, ctx->base.pc_next); > > > + gen_helper_store_msr(cpu_env, t1); > > > tcg_temp_free(t0); > > > + tcg_temp_free(t1); > > > + /* Must stop the translation as machine state (may have) changed > > > */ > > > + /* Note that mtmsr is not always defined as > > > context-synchronizing */ > > > + gen_stop_exception(ctx); > > > + > > > } else { > > > /* > > > * XXX: we need to update nip before the store if we enter > > > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature