Background
==========
On a new cluster we are building right now I moved from Ganglia 3.6.1 to 3.7.2.
3.6.1 has been rock-solid on previous clusters. After 3.7.2 gmond has been up
for a short period of time, it begins emitting the error message:
Incorrect format for spoof argument. exiting.
Debugging
=========
If I enable debugging (e.g. -d 4) I'm shown the parsed contents of the spoof
string -- and they are non-zero garbage strings. Doing some gdb tracing with
breakpoints on that error message, the metric_id passed to the function has
non-zero .spoof and the .host value is a garbage string.
In one trace, the .host was an empty string (""); the code in
Ganglia_host_get() assumes that if .spoof is non-zero, then .host is non-null
and a string with length > 0. So the subsequent code:
spoof_info_len = strlen(metric_id->host);
buff = malloc(spoof_info_len+1);
strncpy(buff, metric_id->host, spoof_info_len + 1);
spoofIP = buff;
if( !(spoofName = strchr(buff+1,':')) ){
can produce a buffer overrun for a zero-length string.
To isolate possible reasons for the botched spoofing hostname I compared the
gmond/gmond.c source between 3.6.1 and 3.7.2. In
Ganglia_collection_group_send() the following code
name = cb->msg.Ganglia_value_msg_u.gstr.metric_id.name;
if (override_hostname != NULL)
{
cb->msg.Ganglia_value_msg_u.gstr.metric_id.host =
apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip :
override_hostname ), ":", (char *) override_hostname, NULL);
cb->msg.Ganglia_value_msg_u.gstr.metric_id.spoof = TRUE;
}
is allocating the callback's .host field from the temporary metrics APR pool;
but the callback is external to this function and lives on beyond the
destruction of that temporary APR pool. Eventually the memory behind
cb->msg.Ganglia_value_msg_u.gstr.metric_id.host will be reused and overwritten,
yielding the "garbage string" condition that's being observed. In 3.6.1, the
.host field was allocated from global_context. If I modified the code cited
above to use global_context rather than gm_pool, gmond runs without throwing
"Incorrect format for spoof argument" errors.
Also, in lib/libgmond.c the static global "myhost"
static char myhost[APRMAXHOSTLEN+1];
is assumed by the rest of the code to have been initialized by the compiler to
be a zero-length string:
if (myhost[0] == '\0')
apr_gethostname( (char*)myhost, APRMAXHOSTLEN+1, gm_pool);
Probably best to be explicit about the initial value of myhost and not assume
an initial value?
static char myhost[APRMAXHOSTLEN+1] = "";
Happy to contribute patch files, etc.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers