Background
==========

On a new cluster we are building right now I moved from Ganglia 3.6.1 to 3.7.2. 
 3.6.1 has been rock-solid on previous clusters.  After 3.7.2 gmond has been up 
for a short period of time, it begins emitting the error message:


Incorrect format for spoof argument. exiting.



Debugging
=========

If I enable debugging (e.g. -d 4) I'm shown the parsed contents of the spoof 
string -- and they are non-zero garbage strings.  Doing some gdb tracing with 
breakpoints on that error message, the metric_id passed to the function has 
non-zero .spoof and the .host value is a garbage string.


In one trace, the .host was an empty string (""); the code in 
Ganglia_host_get() assumes that if .spoof is non-zero, then .host is non-null 
and a string with length > 0.  So the subsequent code:


      spoof_info_len = strlen(metric_id->host);
      buff = malloc(spoof_info_len+1);
      strncpy(buff, metric_id->host, spoof_info_len + 1);
      spoofIP = buff;
      if( !(spoofName = strchr(buff+1,':')) ){


can produce a buffer overrun for a zero-length string.


To isolate possible reasons for the botched spoofing hostname I compared the 
gmond/gmond.c source between 3.6.1 and 3.7.2.  In 
Ganglia_collection_group_send() the following code


            name = cb->msg.Ganglia_value_msg_u.gstr.metric_id.name;
            if (override_hostname != NULL)
              {
                cb->msg.Ganglia_value_msg_u.gstr.metric_id.host = 
apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip : 
override_hostname ), ":", (char *) override_hostname, NULL);
                cb->msg.Ganglia_value_msg_u.gstr.metric_id.spoof = TRUE;
              }


is allocating the callback's .host field from the temporary metrics APR pool; 
but the callback is external to this function and lives on beyond the 
destruction of that temporary APR pool.  Eventually the memory behind 
cb->msg.Ganglia_value_msg_u.gstr.metric_id.host will be reused and overwritten, 
yielding the "garbage string" condition that's being observed.  In 3.6.1, the 
.host field was allocated from global_context.  If I modified the code cited 
above to use global_context rather than gm_pool, gmond runs without throwing 
"Incorrect format for spoof argument" errors.


Also, in lib/libgmond.c the static global "myhost"


static char myhost[APRMAXHOSTLEN+1];


is assumed by the rest of the code to have been initialized by the compiler to 
be a zero-length string:


  if (myhost[0] == '\0')
      apr_gethostname( (char*)myhost, APRMAXHOSTLEN+1, gm_pool);


Probably best to be explicit about the initial value of myhost and not assume 
an initial value?


static char myhost[APRMAXHOSTLEN+1] = "";


Happy to contribute patch files, etc.




::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to