I believe that with a gmetad polling interval of 5 minutes you will probably end up seeing a lot of your nodes as dead. See the host_alive function in the ganglia.php file. The webfrontend will consider a host alive as long as it last heard from it in the last 4*TMAX seconds and I believe that TMAX is set to 20 seconds in the gmond code. Therefore if you reload the webfrontend shortly before gmetad is about to get fresh data there is a good change that most nodes will have TN greater than 4*TMAX. It looks like ganglia3 has TMAX hard coded to 20 seconds for hosts, see:
ganglia-3.0.1/gmond/gmond.c - line 960 I couldn't find it in the code for ganglia2, but with a running gmond it appears to be set to 70 seconds. ~Jason On Thu, 2005-08-18 at 18:43, Utsav Agarwal wrote: > Hello all, > > > > A quick response would help! > > > > Our cluster nodes send udp unicast packets to a gmond ‘collector’. The > gmond.conf on all the nodes (compute and collector) has the following > values: > > cleanup_threshold = 300 secs, heartbeat = 20 secs, collect_every = 300 > secs, time_threshold = 900 secs > > > > Now, the gmetad server polls the gmond ‘collector’ every 300 secs. (5 > minutes). What we see is that the nodes are shown up sometimes, and > then down sometimes. They flap often. Generally, either all nodes are > shown up or all nodes are shown down. While reporting the nodes are > down, it also shows that it received a heartbeat within the last 20 > seconds. > > > > We need to know the exact reason this is happening. > > > > The gmetad.conf file has default values for rrd archives. Changing the > gmetad server to poll every 120 seconds, does not seem to solve the > problem either. > > > > Any suggestions or guidelines to follow for gmetad polling interval > and gmond cleanup_threshold values will be appreciated. > > > > Thanks, > > ------------------------------------------------------------------------------------ > > Utsav Agarwal > > Systems Analyst > > ------------------------------------------------------------------------------------ > > > >