I've deployed ganglia 3.6.0 on ~200 servers. About once per week a gmond becomes unresponsive. When I attach to the gmond process with gdb, I often see that it is blocked acquiring a lock. Unfortunately, I didn't install with debug symbols, so I don't have a lot to go on.
That being said, gmond.c:Ganglia_value_save() shows up in several of the backtraces, and from code inspection it appears that there is a missing host->mutex acquisition/release around apr_hash_get(). I'm not sure this is the problem I'm observing, as it's blocking in the call to apr_thread_mutex_lock() near the end of the function. On the other hand, missing locks have been known to result in corruption that causes unexpected behavior. Also, in the same function there is also a test that host and message are non-null, to return early, but this is pointless as both have already been dereferenced at this point. Anyone care to look to verify my observations? If confirmed, I can prepare a pull up request this afternoon. --jtc -- J.T. Conklin ------------------------------------------------------------------------------ _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers