Richard, [adding ganglia-developers for comments]

 pretty good explanation of what is likely happening, or what can go
wrong. I sent Eli a patch I found useful a while ago, but which is not
in CVS yet (because I fixed the root-problem of the illegal updates).
This should prevent gmetad from ignoring all hosts/metrics if just one
of them is corrupt. Somewhere in the code we go nuts on an error
return.

[gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new
--- rrd_helpers.c       2005-03-15 19:11:33.000000000 +0100
+++ rrd_helpers.c-new   2006-03-30 11:28:26.000000000 +0200
@@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum,
       {
          err_msg("RRD_update (%s): %s", rrd, rrd_get_error());
          pthread_mutex_unlock( &rrd_mutex );
-         return 1;
+         return 0;
       }
    /* debug_msg("Updated rrd %s with value %s", rrd, val); */
    pthread_mutex_unlock( &rrd_mutex );


--- [EMAIL PROTECTED] wrote:

> Eli,
> 
> Martin is most surely right. If you are running an unpatched 3.0.2,
> let me share with you the many ways it can all go wrong.
> 
> gmond generates the hostnames found in the XML stream by reverse DNS
> lookup only. Its internal structures treat every different IP address
> it sees as a different host, regardless of what the reverse DNS entry
> is.
> 
> So, if you have
> 1) Incorrect reverse DNS entries such that 2 different hosts reverse
> map
>   to the same hostname,
> 2) Or 2 NICs on a host that are not teamed (i.e. 2 different
> addresses)
> and
>   the routing allows packets to exit either NIC, hence either source
> address
>   may be used.
> 3) Or a DHCP lease renewal that results in a host changing IP
> addresses.
> 
> Then what will happen is that the XML stream from the cluster will
> contain
> 2 (or more) entries with different IP addrs, but the same name. Even
> in
> the DHCP
> case when only 1 source address is used at a time, gmond will keep
> the
> old IP address
> entry until a timeout, even though it is not being updated. So dups
> arise again.
> 
> Now unfortunately, gmetad only uses the HOSTNAME for the RRD files
> and
> its own
> processing. So if there is a duplicated hostname in the XML stream,
> it
> will update
> the RRDs after parsing the first entry, and then again after parsing
> the
> second.
> As these 2 updates to the same RRD files will occur in less than one
> second, this
> results in an RRD update error.
> 
> On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE
> CLUSTER TO BE ABANDONED.
> So some hosts get updated, some not, and the cluster view does not
> get
> updated.
> If you patch this particular issue, you will still get double
> processing
> for duped
> hosts, which can result in them erroneouly being reported as down
> (for
> example).
> 
> phew.
> long mail.
> 
> - richard
> 
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Martin
> Knoblauch
> Sent: 30 March 2006 08:05
> To: Eli Stair
> Cc: [EMAIL PROTECTED]
> Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts
> that
> are proper in gmond XML
> 
> 
> Eli,
> 
>  yup. That could definitely cause problems. Do you see anything in
> the
> /var/log/messages of the gmetad host?
> 
>  Hmm. You may have to restart *all* gmonds, as well as the gmetad.
> This
> is something that I usually do when my ganglia setup was hosed
> somehow.
> Definitely the case for multicast clusters. Not really sure about
> unicast.
> 
>  And yes - this is not optimal.
> 
> --- Eli Stair <[EMAIL PROTECTED]> wrote:
> 
> > 
> > The only issue I can find at all with this config is that the new 
> > hosts have been deployed by someone with two PTR records, both the 
> > proper one
> > pointing to the A hostname, as well as all having an improper PTR
> -> 
> > linux."FQDN".
> > 
> > Is there a potential that gmetad is doing a lookup of both the
> forward
> > and reverse entries for a host before populating it?  Unfortunately
> 
> > removing the invalid entry for a host and restarting gmetad as well
> > as 
> > the gmond aggregator and the host did not resolve it.
> > 
> > /eli
> > 
> > Eli Stair wrote:
> > > 
> > > My installation started having an issue yesterday afternoon that
> I
> > have
> > > yet to explain or remedy.  One cluster that I have unicasting,
> has
> > > started "losing" hosts... the directory entries on disk never get
> 
> > > created for newly deployed hosts, and gmond reports receiving
> > messages
> > > for the host (and outputs metrics) but gmetad does not report an
> > > "updating host" message, and never creates the RRD's even though
> > the
> > > host is up.
> > > 
> > > The critical problem is that the report graphs for this cluster
> > have
> > > stopped being updated as well, which nix'es my ability to view
> > cluster
> > > load/job level... in addition to not being able to alert on the
> RRD
> > 
> > > values for the individual hosts that are malfunctioning.  Those
> > hosts
> > > that are "good" continue to update their metric RRD's properly,
> > their
> > > host reports are populated etc.  The bad ones I cannot explain...
> > > 
> > > The two questions, if anyone has insight:
> > > 
> > > 1) What is causing gmetad to stop acting on the gmond XML input
> > that it
> > > has available?  I don't see any error or threshhold it's hitting
> > WRT the
> > > hosts, they just don't create/update the RRD
> > > 
> > > 2) Why does the report stop being populated (the graph is still
> > > generated with past data, but not updated with new... not even
> the
> > data
> > > from hosts that ARE functioning individually.
> > > 
> > > I'm continuing on with this, will update with anything else I
> find
> > awry.
> > >  Any suggestions on what to pursue beyond this are welcome... at
> > this
> > > point it looks to me a problem with the magic in gmetad's parsing
> > of the
> > > gmond output, since it is present and up-to-date but not acting
> on
> > it.
> > > 
> > > Cheers,
> > > 
> > > /eli
> > > 
> > > 
> > > Here are the details:
> > > 
> > > server:
> > > ganglia 3.0.2 (x86_64)
> > > 6 (six) multicast clusters polled by gmetad
> > > 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating
> on
> > the
> > > same host as gmetad.
> > > 
> > > clients:
> > > suse9.3 x86_64
> > > ganglia 3.0.2 (x86_64)
> > > 
> > > 
> > > Debug logged info (-d2):
> > > 
> > > Bad host:
> > > 
> > >   Apache error_log for bad host:
> > >     ERROR: opening
> > >
> >
>
'/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankens
> tein.lucasfilm.com/swap_free.rrd':
> > 
> > > No such file or directory
> > > 
> > >   gmond:
> > >     Processing a Ganglia_message from badhost
> > >   gmetad:
> > >     server_thread() received request
> > > "/Opteron_Production-Desktop_Droid_Cluster/badhost" from
> 127.0.0.1
> > > 
> > >   XML:
> > > <HOST NAME="badhost" IP="10.65.34.22" REPORTED="1143682835"
> TN="4"
> > > TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1143677550">
> > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs"
> TN="488"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" >
> > TN="1688" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="disk_free" VAL="57.776" TYPE="double" UNITS="GB"
> > TN="128"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="cpu_speed" VAL="2612" TYPE="uint32" UNITS="MHz"
> > TN="488"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="part_max_used" VAL="52.7" TYPE="float" UNITS=""
> > TN="128"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB"
> > TN="488"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB"
> > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="boottime" VAL="1143590767" TYPE="uint32" UNITS="s" 
> > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS=""
> > TN="488"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS=""
> TN="488"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string"
> > > UNITS="" TN="488" TMAX="1200" DMAX="0" SLOPE="zero"
> > SOURCE="gmond"/>
> > > <METRIC NAME="cpu_user" VAL="93.6" TYPE="float" UNITS="%" TN="27"
> > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="cpu_system" VAL="0.6" TYPE="float" UNITS="%"
> TN="27" 
> > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="load_one" VAL="2.03" TYPE="float" UNITS="" TN="68" 
> > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="8"
> > TMAX="950"
> > > DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="proc_total" VAL="128" TYPE="uint32" UNITS="" TN="8"
> > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_free" VAL="1328356" TYPE="uint32" UNITS="KB"
> > TN="8"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="8"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_buffers" VAL="199232" TYPE="uint32" UNITS="KB"
> > TN="8"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_cached" VAL="4569200" TYPE="uint32" UNITS="KB"
> > TN="8"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="swap_free" VAL="2101964" TYPE="uint32" UNITS="KB"
> > TN="8"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="188"
> > TMAX="300"
> > > DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="bytes_out" VAL="6066.85" TYPE="float"
> > UNITS="bytes/sec"
> > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC 
> > > NAME="bytes_in" VAL="203006.30" TYPE="float"
> > UNITS="bytes/sec"
> > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC 
> > > NAME="numthreads" VAL="2" TYPE="int8" UNITS="" TN="324" TMAX="60"
> 
> > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> <METRIC NAME="numjobs" >
> > VAL="2" TYPE="int8" UNITS="" TN="324"
> > TMAX="60"
> > > DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > > </HOST>
> > > 
> > > 
> > > Good host:
> > > 
> > >   gmond:
> > >     Processing a Ganglia_message from goodhost
> > >   gmetad:
> > >     Updating host goodhost, metric numjobs
> > >     server_thread() received request
> > > "/Opteron_Production-Desktop_Droid_Cluster/goodhost" from
> 127.0.0.1
> > >   XML:
> > > <HOST NAME="goodhost" IP="10.73.16.225" REPORTED="1143682838"
> > TN="1"
> > > TMAX="20" DMAX="0" LOCATION="unspecified"
> > GMOND_STARTED="1143137198">
> > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs"
> TN="838"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" >
> > TN="2039" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="disk_free" VAL="46.667" TYPE="double" UNITS="GB"
> > TN="178"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="cpu_speed" VAL="2411" TYPE="uint32" UNITS="MHz"
> > TN="838"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="part_max_used" VAL="70.5" TYPE="float" UNITS=""
> > TN="178"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB"
> > TN="838"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB"
> > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="boottime" VAL="1142553979" TYPE="uint32" UNITS="s" 
> > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS=""
> > TN="838"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS=""
> TN="838"
> > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string"
> > > UNITS="" TN="838" TMAX="1200" DMAX="0" SLOPE="zero"
> > SOURCE="gmond"/>
> > > <METRIC NAME="cpu_user" VAL="73.1" TYPE="float" UNITS="%" TN="8"
> > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="cpu_system" VAL="3.9" TYPE="float" UNITS="%" TN="8"
> 
> > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="load_one" VAL="1.99" TYPE="float" UNITS="" TN="9" 
> > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="149" 
> > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="proc_total" VAL="156" TYPE="uint32" UNITS=""
> TN="149"
> > 
> > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_free" VAL="2359176" TYPE="uint32" UNITS="KB"
> > TN="28"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB"
> TN="28"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_buffers" VAL="36384" TYPE="uint32" UNITS="KB"
> > TN="28"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="mem_cached" VAL="4162056" TYPE="uint32" UNITS="KB"
> > TN="28"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="swap_free" VAL="1786428" TYPE="uint32" UNITS="KB"
> > TN="28"
> > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/>
> > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="229"
> > TMAX="300"
> > > DMAX="0" SLOPE="zero" SOURCE="gmond"/>
> > > <METRIC NAME="bytes_out" VAL="305162.19" TYPE="float"
> > UNITS="bytes/sec"
> > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC
> 
> > > NAME="bytes_in" VAL="40802.30" TYPE="float"
> > UNITS="bytes/sec"
> > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC
> 
> > > NAME="numthreads" VAL="1" TYPE="int8" UNITS="" TN="844" TMAX="60"
> 
> > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> <METRIC NAME="numjobs" >
> > VAL="1" TYPE="int8" UNITS="" TN="844"
> > TMAX="60"
> > > DMAX="0" SLOPE="both" SOURCE="gmetric"/>
> > > </HOST>
> > > 
> > > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by xPML, a groundbreaking scripting 
> > language that extends applications into web and mobile media.
> Attend 
> > the live webcast
> > and join the prime developer group breaking into this new coding
> > territory!
> >
>
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> > _______________________________________________
> > Ganglia-general mailing list [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > 
> > 
> 
> 
> ------------------------------------------------------
> Martin Knoblauch
> email: k n o b i AT knobisoft DOT de
> www:   http://www.knobisoft.de
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> language that extends applications into web and mobile media. Attend
> the
> live webcast and join the prime developer group breaking into this
> new
> coding territory!
>
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Ganglia-general mailing list [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 
> 
>
------------------------------------------------------------------------
> For more information about Barclays Capital, please
> visit our web site at http://www.barcap.com.
> 
> 
> Internet communications are not secure and therefore the Barclays 
> Group does not accept legal responsibility for the contents of this 
> message.  Although the Barclays Group operates anti-virus programmes,
> 
> it does not accept responsibility for any damage whatsoever that is 
> caused by viruses being passed.  Any views or opinions presented are 
> solely those of the author and do not necessarily represent those of
> the 
> Barclays Group.  Replies to this email may be monitored by the
> Barclays 
> Group for operational or business reasons.
> 
>
------------------------------------------------------------------------
> 
> 
> 


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

Reply via email to