On Wed, Jul 2, 2025 at 5:15 AM Claudio Jeker <[email protected]> wrote: > > On Tue, Jul 01, 2025 at 08:08:33PM +0200, Martin Pieuchot wrote: > > On 01/07/25(Tue) 14:29, K R wrote: > > > On Tue, Jul 1, 2025 at 9:07 AM Claudio Jeker <[email protected]> > > > wrote: > > > > > > > > On Tue, Jul 01, 2025 at 04:57:14AM -0300, K R wrote: > > > > > On Mon, Jun 30, 2025 at 2:39 AM K R <[email protected]> wrote: > > > > > > > > > > > > On Fri, Jun 27, 2025 at 3:34 AM Martin Pieuchot > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > On 26/06/25(Thu) 11:02, K R wrote: > > > > > > > > On Wed, Jun 25, 2025 at 1:30 PM K R <[email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > Hi Alexander, > > > > > > > > > > > > > > > > > > > > The good news: I can consistently reproduce the hang > > > > > > > > > > problem. But the > > > > > > > > > > bad news is that even with a WITNESS kernel and > > > > > > > > > > kern.witness.watch=2 > > > > > > > > > > (or even 3) I don't see any message or kernel panic. > > > > > > > > > > > > > > Do you mind sharing your recipe to reproduce the hang? > > > > > > > > > > Another one. This time I've included output from 'show witness /b', > > > > > below: > > > > > > > > > > pmap_tlb_shootwait: spun outStopped at db_enter+0x14: popq %rbp > > > > > > > > I think the number one question here is why did pmap_tlb_shootwait spin > > > > out? > > > > This is waiting for other CPUs to finish their work. > > > > So first of all what value has tlb_shoot_wait (x /qx tlb_shoot_wait) > > > > The probably also the values of tlb_shoot_first_pcid, tlb_shoot_addr1 > > > > and > > > > tlb_shoot_addr2 are of interest. (use x /qx for the addrs and x /x for > > > > first_pcid). > > > > > > Hi Claudio, please see below the output from the commands you > > > requested. If you need something else, please let me know. > > > > Thanks, could you try the diff below and see if it helps? > > > > > > [...] > > > > The else case of the above if block does pmap_tlb_shootwait() but then > > > > after that a 2nd pmap_tlb_shootwait() is issued. Not sure if this is > > > > safe > > > > in all cases. Especially the first pmap_tlb_shootwait() is done with the > > > > pm_mtx locked. > > > > I believe there's an issue with holding the pm_mtx. > > > > Index: arch/amd64/amd64/pmap.c > > =================================================================== > > RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v > > diff -u -p -r1.180 pmap.c > > --- arch/amd64/amd64/pmap.c 19 Jun 2025 12:01:08 -0000 1.180 > > +++ arch/amd64/amd64/pmap.c 1 Jul 2025 18:06:28 -0000 > > @@ -2947,7 +2947,6 @@ enter_now: > > if (nocache && (opte & PG_N) == 0) /* XXX impossible? */ > > wbinvd_on_all_cpus(); > > pmap_tlb_shootpage(pmap, va, shootself); > > - pmap_tlb_shootwait(); > > PTE_BASE[pl1_i(va)] = npte; > > } > > Just to mention it publicly, this diff is not correct since the > pmap_tlb_shootwait() is required to ensure that all the TLBs are updated > to RO before installing the RW mapping by PTE_BASE[pl1_i(va)] = npte; > > A few people looked at this and the general consensus is that it should be > save (the mutex held should not disrupt the delivery of IPIs). > > The value of tlb_shoot_wait that you reported is 1 which indicates that > one CPU did not respond to the IPI. Why? No idea.
Claudio and Martin: is there anything else I can help you with? If you need additional ddb commands, debug kernel options, diffs to try, etc, please let me know. Best, --Kor > > -- > :wq Claudio
