I've deployed ganglia 3.6.0 on ~200 servers. About once per week a
gmond becomes unresponsive.  When I attach to the gmond process with
gdb, I often see that it is blocked acquiring a lock. Unfortunately, 
I didn't install with debug symbols, so I don't have a lot to go on.

That being said, gmond.c:Ganglia_value_save() shows up in several of
the backtraces, and from code inspection it appears that there is a
missing host->mutex acquisition/release around apr_hash_get().  I'm
not sure this is the problem I'm observing, as it's blocking in the
call to apr_thread_mutex_lock() near the end of the function. On the
other hand, missing locks have been known to result in corruption 
that causes unexpected behavior.

Also, in the same function there is also a test that host and message
are non-null, to return early, but this is pointless as both have
already been dereferenced at this point.

Anyone care to look to verify my observations? If confirmed, I can
prepare a pull up request this afternoon.

    --jtc

-- 
J.T. Conklin

------------------------------------------------------------------------------
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to