On Tue, Jul 1, 2025 at 3:08 PM Martin Pieuchot <[email protected]> wrote:
>
> On 01/07/25(Tue) 14:29, K R wrote:
> > On Tue, Jul 1, 2025 at 9:07 AM Claudio Jeker <[email protected]> 
> > wrote:
> > >
> > > On Tue, Jul 01, 2025 at 04:57:14AM -0300, K R wrote:
> > > > On Mon, Jun 30, 2025 at 2:39 AM K R <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jun 27, 2025 at 3:34 AM Martin Pieuchot <[email protected]> 
> > > > > wrote:
> > > > > >
> > > > > > On 26/06/25(Thu) 11:02, K R wrote:
> > > > > > > On Wed, Jun 25, 2025 at 1:30 PM K R <[email protected]> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > > > Hi Alexander,
> > > > > > > > >
> > > > > > > > > The good news: I can consistently reproduce the hang problem. 
> > > > > > > > >  But the
> > > > > > > > > bad news is that even with a WITNESS kernel and 
> > > > > > > > > kern.witness.watch=2
> > > > > > > > > (or even 3) I don't see any message or kernel panic.
> > > > > >
> > > > > > Do you mind sharing your recipe to reproduce the hang?
> > > >
> > > > Another one.  This time I've included output from 'show witness /b', 
> > > > below:
> > > >
> > > > pmap_tlb_shootwait: spun outStopped at  db_enter+0x14:  popq    %rbp
> > >
> > > I think the number one question here is why did pmap_tlb_shootwait spin
> > > out?
> > > This is waiting for other CPUs to finish their work.
> > > So first of all what value has tlb_shoot_wait (x /qx tlb_shoot_wait)
> > > The probably also the values of tlb_shoot_first_pcid, tlb_shoot_addr1 and
> > > tlb_shoot_addr2 are of interest. (use x /qx for the addrs and x /x for
> > > first_pcid).
> >
> > Hi Claudio, please see below the output from the commands you
> > requested.  If you need something else, please let me know.
>
> Thanks, could you try the diff below and see if it helps?

Hi Martin, thanks!  I've just applied your patch, built a new kernel
and now the machine is running the stress test.  I'll keep you posted.

Thanks again,
--Kor

>
> > > [...]
> > > The else case of the above if block does pmap_tlb_shootwait() but then
> > > after that a 2nd pmap_tlb_shootwait() is issued. Not sure if this is safe
> > > in all cases. Especially the first pmap_tlb_shootwait() is done with the
> > > pm_mtx locked.
>
> I believe there's an issue with holding the pm_mtx.
>
> Index: arch/amd64/amd64/pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> diff -u -p -r1.180 pmap.c
> --- arch/amd64/amd64/pmap.c     19 Jun 2025 12:01:08 -0000      1.180
> +++ arch/amd64/amd64/pmap.c     1 Jul 2025 18:06:28 -0000
> @@ -2947,7 +2947,6 @@ enter_now:
>                 if (nocache && (opte & PG_N) == 0) /* XXX impossible? */
>                         wbinvd_on_all_cpus();
>                 pmap_tlb_shootpage(pmap, va, shootself);
> -               pmap_tlb_shootwait();
>                 PTE_BASE[pl1_i(va)] = npte;
>         }
>
>
>
>

Reply via email to