On Oct 25, 2023, at 4:26 AM, Claudio Jeker <clau...@openbsd.org> wrote: > > On Mon, Oct 23, 2023 at 11:06:53PM +0000, Kurt Miller wrote: >> I experimented with adding a nanosleep after pthread_create() to >> see if that would resolve the segfault issue - it does, but it >> also exposed a new failure mode on -current. Every so often >> the test program would not exit now. Thinking it may be related >> to the detached threads I reworked the test program to use attached >> threads and coordinated shutdown of them with pthread_join(). These >> changes did not affect the new issue - every so often the main thread >> exits after pthread_join() has been called on the created threads and >> the program occasionally gets stuck with a number of threads still being >> reported to egdb/ps etc. These threads should all be gone now since >> pthread_join() has been called on all of them. >> >> Here is the updated version of the program using attached threads and >> a nanosleep() work-around to the original problem along with ps and egdb >> output showing an example of a stuck process after the main thread exited. > > This is very strange. So pthread_join() returned without an error for all > threads but you still have 25 threads sitting in pthread_cond_wait() on > line 55. How is this possible?
I don’t know. It is very strange. > Is there some issue with futex on sparc64? > Could you try a build of libpthread without FUTEX support? I think you need > to adjust lib/librthread/Makefile and lib/libc/thread/Makefile.inc and add > sparc64 to the list of archs with hppa, m88k and sh. I tried this and confirmed with ktrace that futex was no longer being called. The program still occasionally gets stuck in the same way. egdb of the stuck process shows no main thread with a number of pthreads sitting in pthread_cond_wait(). I took a look at our implementation of spin locks and The SPARC Architecture Manual Version 9: https://www.cs.utexas.edu/users/novak/sparcv9.pdf Page 351 shows an example of spin locks using ldstub. Notable is that the example differs from our implementation slightly. The member used after locking is #LoadLoad | #LoadStore whereas we have #StoreStore|#StoreLoad. This is out of my expertise. Could this difference be a problem? Disassembly from libc.so: 000000000000f3c0 <_spinlock>: f3c0: 9d e3 bf 30 save %sp, -208, %sp f3c4: 10 68 00 04 b %xcc, f3d4 <_spinlock+0x14> f3c8: 01 00 00 00 nop f3cc: 40 01 bd b5 call 7eaa0 <_thread_sys_sched_yield> f3d0: 01 00 00 00 nop f3d4: 40 00 15 13 call 14820 <_atomic_lock> f3d8: 90 10 00 18 mov %i0, %o0 f3dc: 80 a2 20 00 cmp %o0, 0 f3e0: 12 4f ff fb bne %icc, f3cc <_spinlock+0xc> f3e4: 01 00 00 00 nop f3e8: 81 43 e0 0a membar #StoreStore|#StoreLoad f3ec: 81 cf e0 08 rett %i7 + 8 f3f0: 01 00 00 00 nop f3f4: 30 68 00 03 b,a %xcc, f400 <_spinlock+0x40> f3f8: 01 00 00 00 nop f3fc: 01 00 00 00 nop f400: 81 c3 e0 08 retl f404: ae 03 c0 17 add %o7, %l7, %l7 f408: 30 68 00 06 b,a %xcc, f420 <unsetenv> f40c: 01 00 00 00 nop f410: 01 00 00 00 nop f414: 01 00 00 00 nop f418: 01 00 00 00 nop f41c: 01 00 00 00 nop 0000000000014820 <_atomic_lock>: 14820: c2 6a 00 00 ldstub [ %o0 ], %g1 14824: 82 08 60 ff and %g1, 0xff, %g1 14828: 9c 03 bf 30 add %sp, -208, %sp 1482c: 82 18 60 ff xor %g1, 0xff, %g1 14830: 9c 23 bf 30 sub %sp, -208, %sp 14834: 80 a0 00 01 cmp %g0, %g1 14838: 81 c3 e0 08 retl 1483c: 90 60 3f ff subc %g0, -1, %o0 000000000000f160 <_spinunlock>: f160: 9c 03 bf 30 add %sp, -208, %sp f164: 81 43 e0 0c membar #StoreStore|#LoadStore f168: c0 2a 00 00 clrb [ %o0 ] f16c: 81 c3 e0 08 retl f170: 9c 23 bf 30 sub %sp, -208, %sp f174: 30 68 00 03 b,a %xcc, f180 <pthread_equal> f178: 01 00 00 00 nop f17c: 01 00 00 00 nop