j...@acorntoolworks.com (J.T. Conklin) writes: >> Anyone care to look to verify my observations? If confirmed, I can >> prepare a pull up request this afternoon. > > I took a deeper look last night, and there appear to be more locking > issues in gmond.c. I'm going to backport my changes to 3.6.1 (as the > 3.7.X concurrencykit dependency is not something I want to deal with > right now), and should have results soon.
We've been running a version of 3.6.1 with changes to gmond hash table locking for the last 5 weeks without problems. I think the changes are correct and still needed, but probably aren not the source of my gmond lockup problem I reported back in October. The gmond lockups that I've observed have been for three clusters on the other side of a WAN from gmetad. More often than not, the gmond daemons for all three clusters lock up at the same time. Clusters on this side of the WAN have never exhibited this behavior. I've connected to gmond with gdb, and get a backtrace like: gdb) thread 2 [Switching to thread 2 (Thread 0xb7c6fb90 (LWP 7072))]#0 0x00d4b410 in __kernel_vsyscall () (gdb) bt #0 0x00d4b410 in __kernel_vsyscall () #1 0x0092ea7b in write () from /lib/libpthread.so.0 #2 0x001300aa in apr_socket_send () from /usr/lib/libapr-1.so.0 #3 0x08050f1a in socket_send_raw () #4 0x0805115f in socket_send () #5 0x08051430 in print_host_metric () #6 0x08051c55 in process_tcp_accept_channel () #7 0x08051e60 in poll_tcp_listen_channels () #8 0x08051ed1 in tcp_listener () #9 0x00136bd6 in ?? () from /usr/lib/libapr-1.so.0 #10 0x00927832 in start_thread () from /lib/libpthread.so.0 #11 0x00866f6e in clone () from /lib/libc.so.6 The accept thread is blocked writing XML output while mutex are held, which will cause gmond to hang until the write is unblocked. On the system running gmond, there is a half-closed connection with quite a bit of data still in the output socket buffer. On the system running gmetad, the connection is not present at all. I suspect, but have not yet confirmed that packet loss over the WAN (perhaps from a stateful device like a firewall that tracks individual flows) caused the connection to be considered closed by gmetad (so it continues to operate fine), but not by gmond. While investigating this problem, I found that I could reliably hang gmond by establishing a connection and then sleeping. The default gmond.conf, without tcp_accept_channel timeouts or ACLs, is vulnerable for DOS attacks against gmond. The DOS could even be inadvertent, for example paging through the XML output with "netcat <server> 8649 | more" and then taking a coffee break. In may be worth considering changing the implementation so that locks aren't held while the XML is being written. A naive implementation, would require a lot of memory to buffer the XML as the hash table is walked, and only writes when the locks are released. But there are more clever approaches. This may be justified to support enviroments with huge clusters, so that gmetad polls themselves aren't responsble for causing gmond to drop UDP packets due to locks being held too long. But for now, perhaps all that is needed is to change the default timeout such that only deployments that really need/want blocking behavior get it. Or maybe extra verbiage in the gmond.conf manual describing the risks of the default. --jtc -- J.T. Conklin ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers