> On Aug 14, 2023, at 5:42 PM, Theo Buehler <t...@theobuehler.org> wrote: > > On Mon, Aug 14, 2023 at 08:47:22PM +0000, Miod Vallat wrote: >> For what it's worth, I couldn't get your test to fail on a dual-cpu >> sun4u. Either it's a sun4v-specific issue or it needs many more cpus to >> trigger. > > I can reproduce the segfault, but seemingly not the killed process on > 16-cpu LDOM ona T4-2: > > cpu0 at mainbus0: SPARC-T4 (rev 0.0) @ 2847.862 MHz > > Segmentation fault (core dumped) > 93 > Segmentation fault (core dumped) > 1616 > Segmentation fault (core dumped) > 4185 > > etc. > > I don't seem to be able to reproduce on a 4-cpu M3000 > > cpu0 at core0: FJSV,SPARC64-VII (rev 10.1) @ 2750 MHz > cpu0: physical 64K instruction (64 b/l), 64K data (64 b/l), 5120K external > (256 b/l)
While chatting with deraadt@ about this he pointed out my statement about the stack being clobbered didn’t make much sense. Looking closer at the core file data it appears that the registers of the main thread don’t appear to be correct when the process segfaults. In the test program each thread has its own mutex and cond_var. The main thread should be utilizing one of the per-thread mutexes and cond_vars. The core files are consistently crashing in the main thread with a back trace that looks like this: Thread 1 (process 557006): #0 0x0000005e81739078 in _rthread_mutex_timedlock (mutexp=0x5f39af5d98, trywait=0, abs=0x0, timed=0) at rthread_mutex.c:163 #1 0x0000005e8176efdc in _rthread_cond_timedwait (cond=<optimized out>, mutexp=0x5f39af5d98, abs=0xc) at rthread_cond.c:121 However, the mutexp address is not one of the per-thread mutexes. The address is not with the threads array at all: (gdb) p &threads $1 = (thread_t (*)[40]) 0x5c61c02058 <threads> (gdb) p &threads[40] $2 = (thread_t *) 0x5c61c02698 mutexp is in the i0 register. It not containing a correct value suggests the registers are not always correct after transitioning back to user land. Perhaps there is some sort of coherency issue?