I'm surprised at your proposal.

If this condition gets detected, why do you think it is fine to
continue?  A kernel data structure is seriously corrupted.

Jeremie Courreges-Anglas <[email protected]> wrote:

> On Fri, Feb 20, 2026 at 04:27:54AM -0500, Kurt Mosiejczuk wrote:
> > For a month or so I've been seeing panics when doing bulk builds
> > on sparc64. It is always of the form seen in the subject. I'm attaching
> > a representative dmesg of the LDOMs that make up the cluster along
> > with traces I've done of the crashes. I often had trouble tracing as
> > switching to another cpu would just hang. Thanks to jca for pointing
> > out that not all cpus may be in the stopped state and pointing out
> > how to avoid those. (Thus better traces as time goes on).
> > 
> > It does not seem to be a hardware issue since it has happened on
> > the LDOMs on multiple T4-1s.
> 
> > login: ctx_free: context 5422 still active
> 
> Well thanks kmos for sending this report on bugs.  It indeed doesn't
> look like a hardware issue to me.  The panic^Wdb_enter() call has been
> added by Mark in the latest pmap.c commit.  Quoting the commit
> message:
> 
>   revision 1.127
>   date: 2025/12/14 12:37:22;  author: kettenis;  state: Exp;  lines: +23 -5;  
> commitid: QtkG6mGBOZVl6MLw;
>   Protect the array that keeps track of which MMU contexts are in use with
>   a mutex.  Also disable the context stealing code.  It isn't mpsafe and we
>   should have more than enough MMU contexts to never need to steal one with
>   the current (hard) limites on the number of processes.
>   
>   This enables some code that checks that a context that is being freed no
>   longer has live entries in the TSB.  This code is somewhat expensive so
>   we may want to disable it again in the not too distant future.
> 
> Since this db_enter() has been plaguing kmos' latest builds up to a
> point that some ports/packages were corrupted, I'd suggest that we
> disable the db_enter() now that we know that this error case can be
> hit.  I've managed to this code path twice, months/weeks ago by
> building large ports on a T4-2 LDOM.  I have ideas to test but I have
> just gotten my hands back on said LDOM and right now I can't even
> reproduce.  But maybe kmos can give it a try, look for printfs and
> confirm that the system recovers when hitting such a condition.
> 
> Thoughts?  ok?
> 
> 
> Index: pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
> diff -u -p -r1.127 pmap.c
> --- pmap.c    14 Dec 2025 12:37:22 -0000      1.127
> +++ pmap.c    11 Mar 2026 22:39:23 -0000
> @@ -2600,7 +2600,6 @@ ctx_free(struct pmap *pm)
>               if (TSB_TAG_CTX(tsb_dmmu[i].tag) == oldctx ||
>                   TSB_TAG_CTX(tsb_immu[i].tag) == oldctx) {
>                       printf("ctx_free: context %d still active\n", oldctx);
> -                     db_enter();
>               }
>       }
>  #endif
> 
> 
> -- 
> jca
> 

Reply via email to