To memcached community, I have found memcached race condition bug, and here is the detail and patch that will fix this problem. We would love to hear what this community think of this patch.
Under normal circumstances, each worker thread accesses its own event_base and same goes with main thread. We have found that under specific situation where worker thread access main thread's event_base (main_base) resulting in unexepected result. [Example Scenario] 1. throw alot clients (well over connection limit) to connect to memcached. 2. memcached's file descriptors reaches maximum setting 3. main thread calls accept_new_conns(false) to stop polling sfd 4. main thread's event_base_loop stops accepting incoming request 5. main thread stops to acceess main_base at this point 6. a client disconnects 7. worker thread calls accept_new_conns(true) to start polling sfd 8. accept_new_conns uses mutex to protect main_base's race condition 9. worker thread starts loop with listen_conn 10. worker thread calls update_event with first conn 11. after first update_event(), main thread start polling sfd and starts to access main_base <- PROBLEM 12. Worker thread continues to call update_event() with second conn At this point, worker thread and main thread both acccess and modify main_base. With incorrect event_count, event_count is set to zero while there is an actual event waiting. The result? memcached passes through event_base_loop() quietly shutting down daemon. [Quick Fix] Set memcached to only listen to a single interface. example memcached setting: > memcached -U 0 -u nobody -p 11222 -t 4 -m 16000 -C -c 1000 -l 192.168.0.1 -v [Reproducing] Use attack script (thanks to mala): http://gist.github.com/522741 w/ -l interface restriction: we have seen over 70 hours of stability - yes, you will see "Too many open connections." but that's not an issue here w/o -l interface restriction: memcached quits w/ attack script Please give us some feedback on attach patch. This should fix the race condition we have experienced. At last, we would like to thank numerous contributors on twitter that helped us nail this problem. http://togetter.com/li/41702 Shigeki