Re: [Ganglia-developers] Re: Scaling Issues? and Memory SizeProblems (combined)

david Mon, 05 Jan 2004 09:44:41 -0800

----- Original Message -----
From: "Jason A. Smith" <[EMAIL PROTECTED]>
To: "Josh Durham" <[EMAIL PROTECTED]>
Cc: "Ganglia Developers" <ganglia-developers@lists.sourceforge.net>
Sent: Wednesday, December 31, 2003 5:55 PM
Subject: Re: [Ganglia-developers] Re: Scaling Issues? and Memory
SizeProblems (combined)



> I have noticed the same exact problem here, occasionally some nodes
> would get marked as down even though they are still up, and it appears
> to be the same timing issue.  Based on what you discovered below it
> appears that gmetad is using an unsigned int to store TN and gmond is
> using a signed int.

I didnt look at the code (I am writing from switerland on a slow slow
connection) but you may be right. Gmetad should probably use a signed int
for tn.

>
> I think I remember several months ago ganglia was patched to call the
> time system call a lot less to improve efficiency, I bet that is when
> this timing bug was introduced which causes the webfrontend to mark some
> nodes as down if the condition you discovered occur.  Any ideas on how
> to fix it without putting all the time system calls back in?
>

I dont think we need to put the time calls back in. Just fix the bug and we
should be fine.

-Federico

> ~Jason
>
>
> On Tue, 2003-12-30 at 18:38, Josh Durham wrote:
> > Thanks for your quick response.
> >
> > So, I've been playing around a bit with the TN thing.  Here is
> > something interesting.. Here is a larger sample of the output from
> > gmetad:
> > telnet localhost 8651:
> > ...
> > <GRID NAME="unspecified" AUTHORITY="http://blahblah/ganglia/";
> > LOCALTIME="1072822698">
> > <CLUSTER NAME="Cluster X" LOCALTIME="1072822524" OWNER="Terascale
> > Computing Facility" LATLONG="unspecified" URL="unspecified">
> > ...
> > <HOST NAME="n0603.tcf-int.vt.edu" IP="10.1.2.175" REPORTED="1072822628"
> > TN="0" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822099">
> > <HOST NAME="n0604.tcf-int.vt.edu" IP="10.1.2.176" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0605.tcf-int.vt.edu" IP="10.1.2.177" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0606.tcf-int.vt.edu" IP="10.1.2.178" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0607.tcf-int.vt.edu" IP="10.1.2.179" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0608.tcf-int.vt.edu" IP="10.1.2.180" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0609.tcf-int.vt.edu" IP="10.1.2.181" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0610.tcf-int.vt.edu" IP="10.1.2.182" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072797006">
> > <HOST NAME="n0611.tcf-int.vt.edu" IP="10.1.2.183" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0612.tcf-int.vt.edu" IP="10.1.2.184" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0613.tcf-int.vt.edu" IP="10.1.2.185" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0614.tcf-int.vt.edu" IP="10.1.2.186" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0615.tcf-int.vt.edu" IP="10.1.2.187" REPORTED="1072822629"
> > TN="4294967295" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0616.tcf-int.vt.edu" IP="10.1.3.11" REPORTED="1072822515"
> > TN="9" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> > <HOST NAME="n0617.tcf-int.vt.edu" IP="10.1.3.12" REPORTED="1072822616"
> > TN="12" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822100">
> >
> > Those that have the funky TNs were all reported at the same time.  I
> > have a feeling it's a timing issue.
> >
> > And actually, I caught gmond doing something similar  I had to run it a
> > few times, but I got (from telnet localhost 8649):
> > ...
> > <GANGLIA_XML VERSION="2.5.5" SOURCE="gmond">
> > <CLUSTER NAME="Cluster X" LOCALTIME="1072823227" OWNER="Terascale
> > Computing Facility" LATLONG="unspecified" URL="unspecified">
> > ...
> > <HOST NAME="n0163.tcf-int.vt.edu" IP="10.1.1.173" REPORTED="1072823228"
> > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822082">
> > <HOST NAME="n0164.tcf-int.vt.edu" IP="10.1.1.174" REPORTED="1072823228"
> > TN="-1" TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1072822082">
> >
> > Is it possible, that because this data is so big, that it is being
> > updated while it's being reported?  I'm not too familiar with the
> > source, but if the following is happening, this could be the problem:
> > 1. gmond receives XML request from gmetad.
> > 2. gmond records current time in client->timestamp.
> > 3. gmond starts to go through the host hash, reporting tn as
> > client->timestamp - node->timestamp (where node->timestamp is REPORTED)
> > 4. gmond receives an update from a computational node after 1 second of
> > the start of the XML request, reports a negative TN?
> >
> > Also, a note.  This is a Dual Processor 1.3GHz Apple G4 XServe.  I have
> > a feeling I could run this on a DP 2.0 GHz G5 without issue, but I'd
> > rather run it on my server platform.
> > So,  if I run just gmond, it takes about 0.8 seconds to pull the XML.
> > When I run gmetad (which is eating up some process cycles,) it goes up
> > to 1.2 seconds.
> >
> > What I don't understand, is gmetad should handle this.. It's check to
> > see if it is up is tn < tmax * 4 (-1 < 60).
> > So, I added this to process_xml.c, line 447:
> > debug_msg("XXXX Host alive: cluster_localtime=%d reported=%d expr=%d
> > tn=%d tmax=%d host_alive=%d",
> >                   xmldata->cluster_localtime,reported,(tn < tmax *
> > 4),tn,tmax,xmldata->host_alive);
> >
> > And I get:
> > XXXX Host alive: cluster_localtime=1072825831 reported=1072825832
> > expr=0 tn=-1 tmax=20 host_alive=0
> >
> > Now I'm baffled.  Why isn't -1 < 20 * 4 coming out as 1?
> > Sorry my rambling.. Thinking outloud, in a way.
> >
> > Any ideas on this?
> >
> > Also, on the mem_total problem I'm having, I'm not sure xdr_hyper is an
> > option.  It doesn't exist in OS X's /etc/include/rpc/xdr.h.  I might be
> > able to use xdr_bytes, but I don't know alot about
> > RPC/XDR.  I was thinking of cheating and having it report MB in the
> > summary RRDs, but that's not really a good solution.
> >
> > I am looking forward to Ganglia 3.  One of the problems I'm having with
> > the Darwin specific metrics is the cpu_*_funcs.  It's easy if I could
> > return user,nice,system, and idle in one function as an array of values
> > (. f(10.0 0.0 5.0 85.0).  The trick is figuring out how to split them
> > up.
> >
> > Also, I havn't checked in a while, but I think my baseline network
> > usage was about 80KB/s while running Ganglia.  Reducing that would be
> > nice on the monitoring nodes.
> >
> > On Tuesday, December 30, 2003, at 08:44 AM, [EMAIL PROTECTED] wrote:
> >
> > > Sweet to hear you are running Ganglia on the G5 cluster. Strange about
> > > the
> > > TN figure, looks like a signed-unsigned int issue. I'll have a look at
> > > the
> > > code when I get back from my holiday vacation.
> > >
> > > Definately send the patches when you get them in order.
> > >
> > > -Federico
> >
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: IBM Linux Tutorials.
> > Become an expert in LINUX or just sharpen your skills.  Sign up for
IBM's
> > Free Linux Tutorials.  Learn everything from the bash shell to sys
admin.
> > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
> > _______________________________________________
> > Ganglia-developers mailing list
> > Ganglia-developers@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-developers
> --
> /------------------------------------------------------------------\
> |  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
> |  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
> |  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
> |  Upton, NY 11973-5000                                            |
> \------------------------------------------------------------------/
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IBM Linux Tutorials.
> Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
> Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
> Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
> _______________________________________________
> Ganglia-developers mailing list
> Ganglia-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>

Re: [Ganglia-developers] Re: Scaling Issues? and Memory SizeProblems (combined)

Reply via email to