j...@acorntoolworks.com (J.T. Conklin) writes:
>> Anyone care to look to verify my observations? If confirmed, I can
>> prepare a pull up request this afternoon.
>
> I took a deeper look last night, and there appear to be more locking
> issues in gmond.c. I'm going to backport my changes to 3.6.1 (as the
> 3.7.X concurrencykit dependency is not something I want to deal with
> right now), and should have results soon.

We've been running a version of 3.6.1 with changes to gmond hash table
locking for the last 5 weeks without problems. I think the changes are
correct and still needed, but probably aren not the source of my gmond
lockup problem I reported back in October.

The gmond lockups that I've observed have been for three clusters on
the other side of a WAN from gmetad.  More often than not, the gmond
daemons for all three clusters lock up at the same time.  Clusters 
on this side of the WAN have never exhibited this behavior.

I've connected to gmond with gdb, and get a backtrace like:

gdb) thread 2
[Switching to thread 2 (Thread 0xb7c6fb90 (LWP 7072))]#0  0x00d4b410 in 
__kernel_vsyscall ()
(gdb) bt
#0  0x00d4b410 in __kernel_vsyscall ()
#1  0x0092ea7b in write () from /lib/libpthread.so.0
#2  0x001300aa in apr_socket_send () from /usr/lib/libapr-1.so.0
#3  0x08050f1a in socket_send_raw ()
#4  0x0805115f in socket_send ()
#5  0x08051430 in print_host_metric ()
#6  0x08051c55 in process_tcp_accept_channel ()
#7  0x08051e60 in poll_tcp_listen_channels ()
#8  0x08051ed1 in tcp_listener ()
#9  0x00136bd6 in ?? () from /usr/lib/libapr-1.so.0
#10 0x00927832 in start_thread () from /lib/libpthread.so.0
#11 0x00866f6e in clone () from /lib/libc.so.6

The accept thread is blocked writing XML output while mutex are held,
which will cause gmond to hang until the write is unblocked.

On the system running gmond, there is a half-closed connection with
quite a bit of data still in the output socket buffer. On the system
running gmetad, the connection is not present at all. I suspect, but
have not yet confirmed that packet loss over the WAN (perhaps from a
stateful device like a firewall that tracks individual flows) caused
the connection to be considered closed by gmetad (so it continues to
operate fine), but not by gmond.

While investigating this problem, I found that I could reliably hang 
gmond by establishing a connection and then sleeping.

The default gmond.conf, without tcp_accept_channel timeouts or ACLs,
is vulnerable for DOS attacks against gmond.  The DOS could even be
inadvertent, for example paging through the XML output with "netcat
<server> 8649 | more" and then taking a coffee break.

In may be worth considering changing the implementation so that locks
aren't held while the XML is being written.  A naive implementation,
would require a lot of memory to buffer the XML as the hash table is 
walked, and only writes when the locks are released. But there are 
more clever approaches.  This may be justified to support enviroments
with huge clusters, so that gmetad polls themselves aren't responsble
for causing gmond to drop UDP packets due to locks being held too long.

But for now, perhaps all that is needed is to change the default
timeout such that only deployments that really need/want blocking
behavior get it. Or maybe extra verbiage in the gmond.conf manual
describing the risks of the default.

   --jtc

-- 
J.T. Conklin

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to