Hi. I'm looking for guidance on how to troubleshoot a piece of software which is spinning after calling fork(2).
I'm working on making the s2n-tls[1] library build on OpenBSD[2]. One of the unit tests[3] does this: 1. The test framework forks the test (s2n_fork_generation_number_test) 2. The test process forks a child and calls waitpid() 3. The child creates 2 threads (pthread_create()), waits for the threads to finish (pthread_join()), forks a grandchild, and calls waitpid() 4. The grandchild is supposed to create 2 threads, wait for the threads, fork a great grandchild, and call waitpid() 5. The great grandchild is supposed to create 2 threads, wait for the threads, and exit 6. The processes exit in reverse order, and the test is finished. During #3, the great grandchild never starts to execute and the grandchild doesn't return from fork(). The test fails after a timeout. ktrace[4] shows the grandchild (pid 91188) getting to the point where it calls fork() (line 23862), but then it spins on sched_yield(): 23860 91188 s2n_fork_generation_num CALL write(2,0x7f7ffffd4f80,0x33) 23861 69848 ctest CALL kevent(4,0x7f7ffffe3d40,0,0x7f7ffffe3d40,1024,0x7f7ffffebd48) 23862 91188 s2n_fork_generation_num GIO fd 2 wrote 51 bytes 23863 "s2n_unit_test_fork: calling fork(); fork_level = 1 23864 " 23865 69848 ctest STRU struct timespec { 9.914000000 } 23866 91188 s2n_fork_generation_num RET write 51/0x33 23867 69848 ctest STRU struct kevent { ident=12, filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=51, udata=0x0 } 23868 91188 s2n_fork_generation_num CALL sched_yield() 23869 69848 ctest RET kevent 1 23870 69848 ctest CALL read(12,0x6eb25d1f000,0x10000) 23871 91188 s2n_fork_generation_num CALL __threxit(0xc296ca59e30) 23872 69848 ctest GIO fd 12 read 51 bytes 23873 "s2n_unit_test_fork: calling fork(); fork_level = 1 23874 " 23875 69848 ctest RET read 51/0x33 23876 91188 s2n_fork_generation_num CALL __threxit(0xc296ca59230) 23877 91188 s2n_fork_generation_num RET sched_yield 0 23878 91188 s2n_fork_generation_num CALL sched_yield() 23879 91188 s2n_fork_generation_num RET sched_yield 0 [...] btrace also doesn't show that the grandchild enters fork() 1668777953577037810 ctest (69848/127562) syscall:fork:entry (0) 1668777953577858328 ctest (69848/127562) syscall:fork:return (87140) # test process is pid 87140 1668777953577876672 ctest (87140/115691) syscall:fork:return (0) 1668777953586430021 s2n_fork_generation_num (87140/115691) syscall:fork:entry (0) 1668777953586746921 s2n_fork_generation_num (87140/115691) syscall:fork:return (62504) # child pid 62504 1668777953586831091 s2n_fork_generation_num (62504/463506) syscall:fork:return (0) 1668777953587202790 s2n_fork_generation_num (87140/115691) syscall:wait4:entry (62504) 1668777953658413274 s2n_fork_generation_num (62504/463506) syscall:fork:entry (0) 1668777953658911264 s2n_fork_generation_num (62504/463506) syscall:fork:return (91188) # grandchild pid 91188 1668777953658967606 s2n_fork_generation_num (91188/186182) syscall:fork:return (0) 1668777953660934467 s2n_fork_generation_num (62504/463506) syscall:wait4:entry (91188) # no entry showing 91188 entering fork() If I comment out the part of the test which creates the threads, the test completes successfully. So some sort of locking/contention issue in the threading code? [1] https://github.com/aws/s2n-tls [2] https://github.com/knightjoel/s2n-tls/tree/fix/build-on-openbsd [3] https://github.com/knightjoel/s2n-tls/blob/fix/build-on-openbsd/tests/unit/s2n_fork_generation_number_test.c [4] https://www.packetmischief.ca/files/fork_test2.txt.gz .joel