On Tue, Nov 07, 2023 at 12:09:36PM +0100, Claudio Jeker wrote:
> On Tue, Oct 31, 2023 at 01:18:42PM +0100, Claudio Jeker wrote:
> > On Mon, Oct 23, 2023 at 11:06:53PM +0000, Kurt Miller wrote:
> > > I experimented with adding a nanosleep after pthread_create() to
> > > see if that would resolve the segfault issue - it does, but it
> > > also exposed a new failure mode on -current. Every so often
> > > the test program would not exit now. Thinking it may be related
> > > to the detached threads I reworked the test program to use attached
> > > threads and coordinated shutdown of them with pthread_join(). These
> > > changes did not affect the new issue - every so often the main thread
> > > exits after pthread_join() has been called on the created threads and
> > > the program occasionally gets stuck with a number of threads still being
> > > reported to egdb/ps etc. These threads should all be gone now since
> > > pthread_join() has been called on all of them.
> > > 
> > > Here is the updated version of the program using attached threads and
> > > a nanosleep() work-around to the original problem along with ps and egdb
> > > output showing an example of a stuck process after the main thread exited.
> > > 
> > > ======== startup.c ==================
> > > #include <pthread.h>
> > > #include <err.h>
> > > #include <time.h>
> > > 
> > > typedef struct {
> > >     long t_num;
> > >     pthread_t t_pthread_id;
> > > 
> > >     /* sync startup/shutdown */
> > >     long t_state;
> > >     pthread_mutex_t t_mutex;
> > >     pthread_cond_t  t_cond_var;
> > > } thread_t;
> > > 
> > > #define NTHREADS 40
> > > 
> > > thread_t threads[NTHREADS];
> > > 
> > > void
> > > init_threads() {
> > >     long t_num;
> > >     for (t_num=0; t_num < NTHREADS; t_num++) {
> > >         threads[t_num].t_num = t_num;
> > >         threads[t_num].t_state = 0;
> > >         if (pthread_mutex_init(&threads[t_num].t_mutex, NULL) != 0)
> > >             err(1, "pthread_mutex_init failed");
> > > 
> > >         if (pthread_cond_init(&threads[t_num].t_cond_var, NULL) != 0)
> > >             err(1, "pthread_cond_init failed");
> > >     }
> > > }
> > > 
> > > void
> > > complete_threads() {
> > >     long t_num;
> > >     for (t_num=0; t_num < NTHREADS; t_num++) {
> > >         pthread_mutex_lock(&threads[t_num].t_mutex);
> > >         threads[t_num].t_state = 2;
> > >         pthread_cond_broadcast(&threads[t_num].t_cond_var);
> > >         pthread_mutex_unlock(&threads[t_num].t_mutex);
> > >     }
> > >     for (t_num=0; t_num < NTHREADS; t_num++) {
> > >         if (pthread_join(threads[t_num].t_pthread_id, NULL) != 0)
> > >             err(1, "pthread_join failed");
> > >     }
> > > }
> > > 
> > > void *
> > > thread_start(thread_t *thread) {
> > >     pthread_mutex_lock(&thread->t_mutex);
> > >     thread->t_state = 1;
> > >     pthread_cond_broadcast(&thread->t_cond_var);
> > >  
> > >     while (thread->t_state != 2)
> > >        pthread_cond_wait(&thread->t_cond_var, &thread->t_mutex); 
> > > 
> > >     pthread_mutex_unlock(&thread->t_mutex);
> > > 
> > >     return(NULL);
> > > }
> > > 
> > > void
> > > create_thread(thread_t *thread) {
> > >     int ret = pthread_create(&thread->t_pthread_id, NULL, (void* 
> > > (*)(void*)) thread_start, thread);
> > >  
> > >     struct timespec timeout = { 0, 1000*1000L };
> > >     nanosleep(&timeout, NULL);
> > > 
> > >     /* wait for thread startup */
> > >     pthread_mutex_lock(&thread->t_mutex);
> > >     while (thread->t_state == 0) 
> > >        pthread_cond_wait(&thread->t_cond_var, &thread->t_mutex);
> > >     pthread_mutex_unlock(&thread->t_mutex);
> > > }
> > > 
> > > int
> > > main( int argc, char *argv[] )
> > > {
> > >     long t_num;
> > > 
> > >     init_threads();
> > > 
> > >     /* startup threads */
> > >     for (t_num=0; t_num < NTHREADS; t_num++) {
> > >         create_thread(&threads[t_num]);
> > >     }
> > > 
> > >     complete_threads();
> > > 
> > >     return 0;
> > > }
> > 
> > I tested your program on an m10-1 and it shows the same behaviour.
> > The problem is that the main thread exits because of some cancelation
> > point. At least that is how it looks like:
> > 
> > Here a ktrace (main proc is 405055):
> > 
> >  25127/543740  thread   CALL 
> > futex(0xb9f26aca00,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,2147483647,0,0)
> >  25127/543740  thread   RET   futex 0
> >  25127/543740  thread   CALL
> > futex(0xb9f26aca00,0x81<FUTEX_WAIT|FUTEX_PRIVATE_FLAG>,1,0,0)
> >  25127/405055  thread   RET   nanosleep 0
> >  25127/405055  thread   CALL  
> > mmap(0,0x82000,0x3<PROT_READ|PROT_WRITE>,0x5002<MAP_PRIVATE|MAP_ANON|MAP_STACK>,-1,0)
> >  25127/405055  thread   RET   mmap 795249991680/0xb92897e000
> >  25127/405055  thread   CALL  mprotect(0xb92897e000,0x2000,0<PROT_NONE>)
> >  25127/405055  thread   RET   mprotect 0
> >  25127/405055  thread   CALL  __tfork(0xffffffcd56f172c0,24)
> >  25127/405055  thread   STRU  struct __tfork { tcb=0xb93e026e00, 
> > tid=0xb93e026e28, stack=0xb9289feda0 }
> >  25127/405055  thread   RET   __tfork 119120/0x1d150
> >  25127/119120  thread   RET   __tfork 0
> >  25127/405055  thread   CALL  nanosleep(0xffffffcd56f173b0,0)
> >  25127/405055  thread   STRU  struct timespec { 0.001000000 }
> >  25127/119120  thread   CALL  
> > futex(0xb9f26facd0,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,2147483647,0,0)
> >  25127/119120  thread   RET   futex 0
> >  25127/119120  thread   CALL  
> > futex(0xb9f26facd0,0x81<FUTEX_WAIT|FUTEX_PRIVATE_FLAG>,1,0,0)
> >  25127/405055  thread   RET   nanosleep 0
> >  25127/405055  thread   CALL  
> > kbind(0xffffffcd56f16dc8,48,0x9b2ed0bc88166e83)
> >  25127/405055  thread   RET   kbind 0
> >  25127/405055  thread   CALL  futex(0xba0cf25928,0x2<FUTEX_WAKE>,1,0,0)
> >  25127/405055  thread   RET   futex 0
> >  25127/405055  thread   CALL  __threxit(0xb940a04b68)
> > 
> > In gdb the _initial_thread has indeed quit and if I read the output
> > correctly via a cancelation point:
> > (gdb) print _initial_thread 
> > $1 = {donesem = {lock = 0 '\000', waitcount = 0, value = 1, shared = 0}, 
> >   flags = 1, flags_lock = 0 '\000', tib = 0xb940a04b40, retval = 0x1, 
> >   fn = 0x0, arg = 0x0, name = "Original thread", '\000' <repeats 16 times>, 
> >   stack = 0x0, threads = {le_next = 0x0, le_prev = 0xb93e027a98}, waiting = 
> > {
> >     tqe_next = 0x0, tqe_prev = 0x0}, blocking_cond = 0x0, attr = {
> >     stack_addr = 0x0, stack_size = 524288, guard_size = 8192, 
> >     detach_state = 0, contention_scope = 2, sched_policy = 2, sched_param = 
> > {
> >       sched_priority = 0}, sched_inherit = 4}, local_storage = 0x0, 
> >   cleanup_fns = 0x0, delayed_cancel = 0}
> > (gdb) print _initial_thread.tib
> > $2 = (tib *) 0xb940a04b40
> > (gdb) print *_initial_thread.tib
> > $3 = {tib_dtv = 0x0, tib_thread = 0xba0cf25920 <_initial_thread>, 
> >   tib_locale = 0x0, tib_errno = 0, tib_canceled = 0, tib_cancel_point = 1, 
> >   tib_cantcancel = 2, tib_tid = 0, tib_thread_flags = 2, tib_atexit = 0x0}
> > 
> > Now I have no clue about cancelation points so no idea why it happens.
> > But because of this the main thread exits before calling
> > complete_threads() and therefor the pthread_join is never called.
> > 
> > Not sure how to debug this further.
> 
> Played a bit more around. Here is what I found so far.
> So what happens is that in nanosleep() the code goes off the rails in
> LEAVE_CANCEL_POINT(1);
> 
>         ENTER_CANCEL_POINT(1);
>         ret = HIDDEN(nanosleep)(timeout, remainder);
>         LEAVE_CANCEL_POINT(1);
>         return (ret);
> 
> This code is:
> 
>       {
>               struct tib *_tib = TIB_GET();
>               int _cantcancel = (_tib)->tib_cantcancel;
>               if (_cantcancel == 0) {
>                       ...
>               }
>               ret = HIDDEN(nanosleep)(timeout, remainder);
>               if (_cantcancel == 0) {
>                       (_tib)->tib_cancel_point = 0;
>                       if (1) {
>                               __asm volatile("":::"memory");
>                               if (__predict_false((_tib)->tib_canceled))
>                                       _thread_canceled();
>                       }
>               }
>       }
>       return (ret);
> 
> After the nanosleep() call the _tib register value is altered.
> The assembly uses %l0 to store _tib but on return from the syscall this
> value is sometimes pointing to the value of &thread->t_pthread_id from the
> previous pthread_create() call. Now because of this tib_canceled is not 0
> and _thread_canceled() is called. I managed to create this condition
> numerous times and the value is always &thread->t_pthread_id of the last
> thread created.
> 
> I tried to figure out where %l0 could be set to &thread->t_pthread_id.
> Now in create_thread() the value is not stored in %l0 (IIRC it just uses
> %o0 and does not touch any local register for this value).
> In pthread_create() the value is the moved to the stack and later loaded
> into %l2 (IIRC). No other code uses this value. So there is no simple way
> for %l0 to be set to this value.
> 
> Now my theory is that somehow a window was not properly spilled or filled
> and so on return from nanosleep the kernel restores %l0 with a bad value.
> Now this does not happen all the time, so most probably it requires some
> extra condition (extra traps).
> 
> A lot of this points towards a register window error in__tfork() which
> affects only the parent process.

I think the issue is that cpu_fork() copies the u_pcb from parent to child
and with that the special user register windows that got spilled because
of some TL>0 fault. Since __tfork() uses a new stack it makes no sense
to keep those windows around and possibly fault them back from the new
thread (onto the old stack).

The following diff seems to fix the above repoducer for me. Please test.
I will need to add some comments etc there (also not sure if the bzero()
calls are needed or if clearing pcb_nsaved is enough.
-- 
:wq Claudio

Index: vm_machdep.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/vm_machdep.c,v
diff -u -p -r1.42 vm_machdep.c
--- vm_machdep.c        25 Oct 2022 06:05:57 -0000      1.42
+++ vm_machdep.c        15 Feb 2024 15:12:39 -0000
@@ -145,6 +145,11 @@ cpu_fork(struct proc *p1, struct proc *p
                    sizeof(struct fpstate));
        } else
                p2->p_md.md_fpstate = NULL;
+       if (stack != NULL) {
+               npcb->pcb_nsaved = 0;
+               bzero(npcb->pcb_rwsp, sizeof(npcb->pcb_rwsp));
+               bzero(npcb->pcb_rw, sizeof(npcb->pcb_rw));
+       }
 
        /*
         * Setup (kernel) stack frame that will by-pass the child

Reply via email to