Hi misc@,

I'm currently working on an OCaml library[1] that does some multicore
stuff. It spawns a bunch of worker threads, and the main thread is
responsible for scheduling tasks onto them. The tasks usually yield
to the main thread just before doing I/O, and the main thread uses
select(2) (for now) to know which pending tasks to reschedule.
IIUC, the main thread is also responsible for handling signals.

I have a problem when running a dummy HTTP server that uses this
library[2]: when sending it a SIGINT during the main loop, it should
die immediately.  Instead, it still waits for one more request and
dies after answering it. I tried setting a breakpoint inside the
main loop, where select(2) is blocking:

$ egdb hurl.srv
# [...]
(gdb) run 127.0.0.1:8080
Starting program: /home/user/.opam/miou/bin/hurl.srv 127.0.0.1:8080
^C[New thread 460857 of process 5725]
[New thread 431086 of process 5725]
[New thread 184835 of process 5725]

Thread 2 received signal SIGINT, Interrupt.
[Switching to thread 460857 of process 5725]
futex () at /tmp/-:2
warning: 2      /tmp/-: No such file or directory

(gdb) info threads
  Id   Target Id                     Frame
  1    thread 338321 of process 5725 _thread_sys_select () at /tmp/-:2
* 2    thread 460857 of process 5725 futex () at /tmp/-:2
  3    thread 431086 of process 5725 futex () at /tmp/-:2
  4    thread 184835 of process 5725 futex () at /tmp/-:2

(gdb) thread apply 1 bt

Thread 1 (thread 338321 of process 5725):
#0  _thread_sys_select () at /tmp/-:2
#1  0xc4987c91079e9fd8 in ?? ()
#2  0x000004def4d99ce2 in _libc_select_cancel (nfds=8,
readfds=0x78483c5bd4a0, writefds=0x78483c5bd520,
exceptfds=0x4def4d8fb2b <_thread_sys_select+27>, timeout=0x0) at
/usr/src/lib/libc/sys/w_select.c:28
#3  0x000004dc1c6cc29d in caml_unix_select (readfds=5356133755072,
writefds=1, exceptfds=1, timeout=<optimized out>) at select_unix.c:91
#4  <signal handler called>
#5  0x000004dc1c5874ab in camlMiou_unix.select_1434 ()
#6  0x000004dc1c5931de in camlMiou.unblock_awaits_with_system_events_2039 ()
#7  0x000004dc1c593583 in camlMiou.run_2061 ()
#8  0x000004dc1c598ec5 in camlMiou.run_inner_5845 ()
#9  0x000004dc1c5ed488 in camlCmdliner_term.fun_662 ()
#10 0x000004dc1c5f19f9 in camlCmdliner_eval.run_parser_589 ()
#11 0x000004dc1c5f2bed in camlCmdliner_eval.eval_value_inner_1728 ()
#12 0x000004dc1c5f33d0 in camlCmdliner_eval.eval_1479 ()
#13 0x000004dc1c43f99e in camlDune__exe__Srv.entry ()
#14 0x000004dc1c437487 in caml_startup.code_begin ()
#15 <signal handler called>
#16 0x000004dc1c7121cc in caml_startup_common (argv=0x78483c5bd7c8,
pooling=<optimized out>) at runtime/startup_nat.c:127
#17 0x000004dc1c71227d in caml_startup_exn (argv=0x8) at
runtime/startup_nat.c:134
#18 caml_startup (argv=0x8) at runtime/startup_nat.c:139
#19 caml_main (argv=0x8) at runtime/startup_nat.c:146
#20 0x000004dc1c6f4f90 in main (argc=<optimized out>,
argv=0x78483c5bd4a0) at runtime/main.c:37

At startup, hurl.srv does a bunch of non-blocking calls to select(2)
(timeout set to {0}) which, after a bit of gdb stepping, do not
seem to switch the currently executing thread. It's on the last
select(2) call (which is blocking, timeout set to NULL, since it's
entering the main loop) that stepping past _thread_sys_select causes
thread switching (in the gdb trace above: Id 2, a worker thread).
pthreads(3) mentions that "Signal handlers are normally run on the
stack of the currently executing thread", so my understanding is
that my SIGINT isn't handled by the right thread (which should be
the main one).

I'm getting the same kind of behavior when using versions of this
library that use poll(2), ppoll(2) or kqueue(2) instead of select(2):
the main thread switches with a worker thread when blocking on the
syscall. On Linux, the same gdb session (breaking inside the main
loop) shows that the current thread is still the main thread (waiting
on select(2)), and it behaves as I would expect (dies when receiving
SIGINT).

I'm having a hard time understanding what's happening. I've looked
at src/lib/libc/sys/w_{poll,select}.c, maybe this has something to
do with {ENTER,LEAVE}_CANCEL_POINT / DEF_CANCEL? Any insight or
pointer would be greatly appreciated.

Cheers,

Léo

[1] https://github.com/robur-coop/miou
[2] https://github.com/robur-coop/hurl

Reply via email to