Today, Steven A. DuChene wrote forth saying...

> It was not just one person. Our oscar core team seems to be able to
> reproduce this problem on a frequent basis. See the bug report at:
> 
> https://sourceforge.net/tracker/index.php?func=detail&aid=602940&group_id=9368&atid=109368

i've read the bug report on your web site.  while it doesn't have many 
details.. i do know a good test which will lend useful information.

in the ./monitor-core/gmond/gmond.c file is the send_all_metric_data() 
function.  if the problem exists when the entire cluster is rebooted then 
this function is a good candidate for being the problem.

without this function, new gmonds will have incomplete information until 
the longest time threshold has been passed (about an hour in the default 
configuration).  to fix this.. i have the gmond processes reset their 
thresholds when they get data from a new gmond.  this could cause problems 
if every gmond is a "new" gmond from a reboot.  

to test this theory.. just comment out the body of the 
send_all_metric_data() function.  if you reboot and the problem doesn't 
show up then we've found our problem.

in ganglia 3 (which we are coding up now)... the syncronizating will 
happen via tcp connection to the eldest gmond via XML instead of multicast 
XDR.

can you also check /var/log/messages?  do a kill -11?  or run gmond in 
debug mode to help diagnose the problem?

thanks for the help.  now that i know this is a bug with 2.5.0.. i want to 
make sure it doesn't exist in 2.5.1.

-matt


Reply via email to