[Ganglia-general] gmond --without-kvm?

2003-03-26 Thread Lester Vecsey
I compiled gmond on FreeBSD 4.4-RELEASE and I'm running it with a non-root
account.. /dev/mem on the machine isn't accessible from this account, and so
theres a segfault on kvm_open when I run gmond. For now I just cleaned out
the swap function that calls kvm_open in gmond, there were a couple of them.
I tried './configure --without-kvm' but it didn't seem to leave the config.h
set properly, which has a HAVE_LIBKVM define.. additionally the gmond
sources don't use that define around the functions that I had to strip out.

I don't have a patch for this but I thought I'd submit this info and see if
others on FreeBSD have encountered something similar.




Re: [Ganglia-general] Display problem

2003-03-26 Thread Jason A. Smith
I am not sure if this is related, but ganglia doesn't seem to behave
very well if a whole cluster either stops reporting or is removed.  The
rrds no longer get updated and gmetad still keeps a copy of the xml data
from the last time it got it from the cluster, but the webfrontend makes
it appear like the whole cluster is still up and running.  It even says
the last heartbeat was only a few seconds ago.  The only clue that
something is wrong are the empty graphs.

I first noticed this when we installed ganglia on a new cluster, then
removed it a few days later.  I expected the webfrontend to show the
entire cluster as dead, but it didn't.  This could be dangerous for
example if you are using ganglia to monitor your cluster and have some
kind of network failure in the part of your cluster that is defined as
the data_source for gmetad, or those nodes just die themselves.  Except
for the graphs it will still look like it is up and reporting when it
really isn't.

I haven't had the time to investigate this more, but there must be some
sort of bug in the webfrontend scripts that make it appear that the
nodes are still up and running, and were even heard from a few seconds
ago.  What about gmetad though, should it expire any of its data if it
hasn't been updated after some time or just keep it around so you have
to manually restart it if you want to flush out the old cluster's data?

~Jason


On Wed, 2003-03-26 at 13:50, Steven Wagner wrote:
> matt massie wrote:
> > prashant-
> > 
> > so when a node in the cluster dies the cluster size changes but the dead 
> > node is not reported?
> > 
> > this is a new problem that i haven't heard of before.  did gmond get 
> > restarted after the node failed?  ganglia knows the a node dies when it 
> > stops getting heartbeats from a machine that it previously heard from.  if 
> > gmond is getting restarted somehow it wouldn't know about the dead node 
> > because it hasn't even received a single heartbeat from it (remember that 
> > everything in gmond is soft state).
> > 
> > is it possible that your gmond data source was restarted after the node 
> > died?
> > 
> > i'm sure if we walk through this we'll find the solution to the problem.
> 
> Now that I think about it, I seem to recall this happening to me in one of 
> the recent (but not current) 2.5.x frontend revisions.  There was a bug in 
> (I believe) ganglia.php which was not incrementing the dead node array.
> 
> I'm pretty sure the reason I didn't respond to the original message was 
> that he's using the most current version and still gets the same behavior, 
> so I was stumped.  But I just had that idea again and decided to throw it 
> out there in the hope of it being useful...
> 
> And I know none of the regular readers of this list believe me, but I 
> really *do* try not to go shooting off my mouth when I have no idea how to 
> fix the problem... :)
> 
> 
> 
> ---
> This SF.net email is sponsored by:
> The Definitive IT and Networking Event. Be There!
> NetWorld+Interop Las Vegas 2003 -- Register today!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
-- 
/--\
|  Jason A. Smith  Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510MPhone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:(631)344-7616   |
|  Upton, NY 11973-5000|
\--/




Re: [Ganglia-general] Display problem

2003-03-26 Thread Steven Wagner

matt massie wrote:

prashant-

so when a node in the cluster dies the cluster size changes but the dead 
node is not reported?


this is a new problem that i haven't heard of before.  did gmond get 
restarted after the node failed?  ganglia knows the a node dies when it 
stops getting heartbeats from a machine that it previously heard from.  if 
gmond is getting restarted somehow it wouldn't know about the dead node 
because it hasn't even received a single heartbeat from it (remember that 
everything in gmond is soft state).


is it possible that your gmond data source was restarted after the node 
died?


i'm sure if we walk through this we'll find the solution to the problem.


Now that I think about it, I seem to recall this happening to me in one of 
the recent (but not current) 2.5.x frontend revisions.  There was a bug in 
(I believe) ganglia.php which was not incrementing the dead node array.


I'm pretty sure the reason I didn't respond to the original message was 
that he's using the most current version and still gets the same behavior, 
so I was stumped.  But I just had that idea again and decided to throw it 
out there in the hope of it being useful...


And I know none of the regular readers of this list believe me, but I 
really *do* try not to go shooting off my mouth when I have no idea how to 
fix the problem... :)





Re: [Ganglia-general] Display problem

2003-03-26 Thread matt massie
prashant-

so when a node in the cluster dies the cluster size changes but the dead 
node is not reported?

this is a new problem that i haven't heard of before.  did gmond get 
restarted after the node failed?  ganglia knows the a node dies when it 
stops getting heartbeats from a machine that it previously heard from.  if 
gmond is getting restarted somehow it wouldn't know about the dead node 
because it hasn't even received a single heartbeat from it (remember that 
everything in gmond is soft state).

is it possible that your gmond data source was restarted after the node 
died?

i'm sure if we walk through this we'll find the solution to the problem.
-- 
matt

Yesterday, Prashant Bhamidipati wrote forth saying...

> Hi Steven / Matt,
> 
> I have Ganglia up and running on two farms  and everything was
> working well till 2 days back.
> 
> One of the machines on a farm was lost due to a network connection
> problem.
> 
> But ganglia still shows all nodes to be up and running ( ??? ) How can
> I rectify this problem.
> 
> For eg: If from 12 nodes, one died out, ganglia tells me that there are 11
> nodes inthe cluster and all are up and working  i.e: there are zero nodes
> down. Why does it not tell me that the 12th node is dead and that there
> are 11 nodes out of 12 working instead ?
> 
> -Prashant
> 
> 
> 
> 
> 
> ---
> This SF.net email is sponsored by:
> The Definitive IT and Networking Event. Be There!
> NetWorld+Interop Las Vegas 2003 -- Register today!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>