Hi Marc, First of all, thanks for the great library, the Mesos project has been using heavily in production for many years. Recently, we've encountered a rare issue in which SIGPIPE occurs during a call to ev_async_send.
The environment is CentOS 5 (no eventfd, and epolll backend). The call to write() against evpipe[1] in evpipe_write triggers a SIGPIPE, indicating that evpipe[0] is closed. The bug is rare (on the order of 10 occurrences per day on production clusters of tens of thousands of servers). We initially suspected that there was a rogue close in the non-libev code (e.g. double close on an fd) that was accidentally closing evpipe[0] due to fd reuse. However, having ruled out bad close calls within our code, I started digging into libev and noticed what appears to be the bug. Consider the following situation: Thread1 (in ev_run): backend_poll -> epoll_poll -> sets postfork = 1 Thread1 (in ev_run): return from backend_poll, but before setting pipe_write_wanted = 0 Thread2 (in ev_async_send): evpipe_write -> sees pipe_write_wanted = 1 and enters the if block Thread1 (in ev_run): comes back to top of the loop and checks postfork, calls loop_fork and closes evpipe[0] but does not yet call evpipe_init Thread2 (in ev_async_send): evpipe_write continues and calls write() -> SIGPIPE To test this theory, we inserted usleep calls in the three locations to increase the likelyhood of the race occurring during our test suite. Here is a diff against ev.c revision 1.477: $ diff ev.c ev.c.sleeps 2311a2312,2313 > usleep(1000); > 2930a2933,2934 > usleep(10000); > 3523a3528,3529 > usleep(10); > If easier, here is a gist in which you can find the injected sleeps by looking for "usleep": https://gist.github.com/bmahler/0dd5ade8083dbd9f94f7 Once these are injected, our test suite regularly trips the SIGPIPE: W0100 00:00:00.000000 41361 main.cpp:38] RAW: Received signal SIGPIPE; escalating to SIGABRT *** Aborted at 1444415504 (unix time) try "date -d @1444415504" if you are using GNU date *** PC: @ 0x7fb365a75b6d raise *** SIGABRT (@0xa191) received by PID 41361 (TID 0x7fb367984780) from PID 41361; stack trace: *** @ 0x7fb365a75ca0 (unknown) @ 0x7fb365a75b6d raise @ 0x4e51d9 handler() @ 0x7fb365a75ca0 (unknown) @ 0x7fb365a74a2b __libc_write @ 0x40ad3b evpipe_write.part.5 @ 0x77ce9d process::run_in_event_loop<>() This seems to confirm the issue! With eventfd enabled the write will silently fail rather than trigger SIGPIPE. This also seems problematic, yes? Looking forward to hearing your thoughts Marc! Ben
_______________________________________________ libev mailing list [email protected] http://lists.schmorp.de/mailman/listinfo/libev
