Hi.

I'm looking for guidance on how to troubleshoot a piece of software
which is spinning after calling fork(2).

I'm working on making the s2n-tls[1] library build on OpenBSD[2]. One
of the unit tests[3] does this:

1. The test framework forks the test (s2n_fork_generation_number_test)
2. The test process forks a child and calls waitpid()
3. The child creates 2 threads (pthread_create()), waits for the
threads to finish (pthread_join()), forks a grandchild, and calls
waitpid()
4. The grandchild is supposed to create 2 threads, wait for the
threads, fork a great grandchild, and call waitpid()
5. The great grandchild is supposed to create 2 threads, wait for the
threads, and exit
6. The processes exit in reverse order, and the test is finished.

During #3, the great grandchild never starts to execute and the
grandchild doesn't return from fork(). The test fails after a timeout.

ktrace[4] shows the grandchild (pid 91188) getting to the point where
it calls fork() (line 23862), but then it spins on sched_yield():

  23860  91188 s2n_fork_generation_num CALL  write(2,0x7f7ffffd4f80,0x33)
  23861  69848 ctest    CALL
kevent(4,0x7f7ffffe3d40,0,0x7f7ffffe3d40,1024,0x7f7ffffebd48)
  23862  91188 s2n_fork_generation_num GIO   fd 2 wrote 51 bytes
  23863        "s2n_unit_test_fork: calling fork(); fork_level = 1
  23864        "
  23865  69848 ctest    STRU  struct timespec { 9.914000000 }
  23866  91188 s2n_fork_generation_num RET   write 51/0x33
  23867  69848 ctest    STRU  struct kevent { ident=12,
filter=EVFILT_READ, flags=0x1<EV_ADD>, fflags=0<>, data=51, udata=0x0
}
  23868  91188 s2n_fork_generation_num CALL  sched_yield()
  23869  69848 ctest    RET   kevent 1
  23870  69848 ctest    CALL  read(12,0x6eb25d1f000,0x10000)
  23871  91188 s2n_fork_generation_num CALL  __threxit(0xc296ca59e30)
  23872  69848 ctest    GIO   fd 12 read 51 bytes
  23873        "s2n_unit_test_fork: calling fork(); fork_level = 1
  23874        "
  23875  69848 ctest    RET   read 51/0x33
  23876  91188 s2n_fork_generation_num CALL  __threxit(0xc296ca59230)
  23877  91188 s2n_fork_generation_num RET   sched_yield 0
  23878  91188 s2n_fork_generation_num CALL  sched_yield()
  23879  91188 s2n_fork_generation_num RET   sched_yield 0
[...]

btrace also doesn't show that the grandchild enters fork()

1668777953577037810 ctest (69848/127562) syscall:fork:entry (0)
1668777953577858328 ctest (69848/127562) syscall:fork:return (87140)
 # test process is pid 87140
1668777953577876672 ctest (87140/115691) syscall:fork:return (0)
1668777953586430021 s2n_fork_generation_num (87140/115691)
syscall:fork:entry (0)
1668777953586746921 s2n_fork_generation_num (87140/115691)
syscall:fork:return (62504)    # child pid 62504
1668777953586831091 s2n_fork_generation_num (62504/463506)
syscall:fork:return (0)
1668777953587202790 s2n_fork_generation_num (87140/115691)
syscall:wait4:entry (62504)
1668777953658413274 s2n_fork_generation_num (62504/463506)
syscall:fork:entry (0)
1668777953658911264 s2n_fork_generation_num (62504/463506)
syscall:fork:return (91188)    # grandchild pid 91188
1668777953658967606 s2n_fork_generation_num (91188/186182)
syscall:fork:return (0)
1668777953660934467 s2n_fork_generation_num (62504/463506)
syscall:wait4:entry (91188)
# no entry showing 91188 entering fork()

If I comment out the part of the test which creates the threads, the
test completes successfully. So some sort of locking/contention issue
in the threading code?

[1] https://github.com/aws/s2n-tls
[2] https://github.com/knightjoel/s2n-tls/tree/fix/build-on-openbsd
[3] 
https://github.com/knightjoel/s2n-tls/blob/fix/build-on-openbsd/tests/unit/s2n_fork_generation_number_test.c
[4] https://www.packetmischief.ca/files/fork_test2.txt.gz



.joel

Reply via email to