Richard, [adding ganglia-developers for comments] pretty good explanation of what is likely happening, or what can go wrong. I sent Eli a patch I found useful a while ago, but which is not in CVS yet (because I fixed the root-problem of the illegal updates). This should prevent gmetad from ignoring all hosts/metrics if just one of them is corrupt. Somewhere in the code we go nuts on an error return.
[gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.000000000 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.000000000 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg("RRD_update (%s): %s", rrd, rrd_get_error()); pthread_mutex_unlock( &rrd_mutex ); - return 1; + return 0; } /* debug_msg("Updated rrd %s with value %s", rrd, val); */ pthread_mutex_unlock( &rrd_mutex ); --- [EMAIL PROTECTED] wrote: > Eli, > > Martin is most surely right. If you are running an unpatched 3.0.2, > let me share with you the many ways it can all go wrong. > > gmond generates the hostnames found in the XML stream by reverse DNS > lookup only. Its internal structures treat every different IP address > it sees as a different host, regardless of what the reverse DNS entry > is. > > So, if you have > 1) Incorrect reverse DNS entries such that 2 different hosts reverse > map > to the same hostname, > 2) Or 2 NICs on a host that are not teamed (i.e. 2 different > addresses) > and > the routing allows packets to exit either NIC, hence either source > address > may be used. > 3) Or a DHCP lease renewal that results in a host changing IP > addresses. > > Then what will happen is that the XML stream from the cluster will > contain > 2 (or more) entries with different IP addrs, but the same name. Even > in > the DHCP > case when only 1 source address is used at a time, gmond will keep > the > old IP address > entry until a timeout, even though it is not being updated. So dups > arise again. > > Now unfortunately, gmetad only uses the HOSTNAME for the RRD files > and > its own > processing. So if there is a duplicated hostname in the XML stream, > it > will update > the RRDs after parsing the first entry, and then again after parsing > the > second. > As these 2 updates to the same RRD files will occur in less than one > second, this > results in an RRD update error. > > On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE > CLUSTER TO BE ABANDONED. > So some hosts get updated, some not, and the cluster view does not > get > updated. > If you patch this particular issue, you will still get double > processing > for duped > hosts, which can result in them erroneouly being reported as down > (for > example). > > phew. > long mail. > > - richard > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Martin > Knoblauch > Sent: 30 March 2006 08:05 > To: Eli Stair > Cc: [EMAIL PROTECTED] > Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts > that > are proper in gmond XML > > > Eli, > > yup. That could definitely cause problems. Do you see anything in > the > /var/log/messages of the gmetad host? > > Hmm. You may have to restart *all* gmonds, as well as the gmetad. > This > is something that I usually do when my ganglia setup was hosed > somehow. > Definitely the case for multicast clusters. Not really sure about > unicast. > > And yes - this is not optimal. > > --- Eli Stair <[EMAIL PROTECTED]> wrote: > > > > > The only issue I can find at all with this config is that the new > > hosts have been deployed by someone with two PTR records, both the > > proper one > > pointing to the A hostname, as well as all having an improper PTR > -> > > linux."FQDN". > > > > Is there a potential that gmetad is doing a lookup of both the > forward > > and reverse entries for a host before populating it? Unfortunately > > > removing the invalid entry for a host and restarting gmetad as well > > as > > the gmond aggregator and the host did not resolve it. > > > > /eli > > > > Eli Stair wrote: > > > > > > My installation started having an issue yesterday afternoon that > I > > have > > > yet to explain or remedy. One cluster that I have unicasting, > has > > > started "losing" hosts... the directory entries on disk never get > > > > created for newly deployed hosts, and gmond reports receiving > > messages > > > for the host (and outputs metrics) but gmetad does not report an > > > "updating host" message, and never creates the RRD's even though > > the > > > host is up. > > > > > > The critical problem is that the report graphs for this cluster > > have > > > stopped being updated as well, which nix'es my ability to view > > cluster > > > load/job level... in addition to not being able to alert on the > RRD > > > > > values for the individual hosts that are malfunctioning. Those > > hosts > > > that are "good" continue to update their metric RRD's properly, > > their > > > host reports are populated etc. The bad ones I cannot explain... > > > > > > The two questions, if anyone has insight: > > > > > > 1) What is causing gmetad to stop acting on the gmond XML input > > that it > > > has available? I don't see any error or threshhold it's hitting > > WRT the > > > hosts, they just don't create/update the RRD > > > > > > 2) Why does the report stop being populated (the graph is still > > > generated with past data, but not updated with new... not even > the > > data > > > from hosts that ARE functioning individually. > > > > > > I'm continuing on with this, will update with anything else I > find > > awry. > > > Any suggestions on what to pursue beyond this are welcome... at > > this > > > point it looks to me a problem with the magic in gmetad's parsing > > of the > > > gmond output, since it is present and up-to-date but not acting > on > > it. > > > > > > Cheers, > > > > > > /eli > > > > > > > > > Here are the details: > > > > > > server: > > > ganglia 3.0.2 (x86_64) > > > 6 (six) multicast clusters polled by gmetad > > > 1 (one) unicast cluster, reporting to a 'mute' gmond aggregating > on > > the > > > same host as gmetad. > > > > > > clients: > > > suse9.3 x86_64 > > > ganglia 3.0.2 (x86_64) > > > > > > > > > Debug logged info (-d2): > > > > > > Bad host: > > > > > > Apache error_log for bad host: > > > ERROR: opening > > > > > > '/var/lib/ganglia/rrds/Opteron_Production-Desktop_Droid_Cluster/frankens > tein.lucasfilm.com/swap_free.rrd': > > > > > No such file or directory > > > > > > gmond: > > > Processing a Ganglia_message from badhost > > > gmetad: > > > server_thread() received request > > > "/Opteron_Production-Desktop_Droid_Cluster/badhost" from > 127.0.0.1 > > > > > > XML: > > > <HOST NAME="badhost" IP="10.65.34.22" REPORTED="1143682835" > TN="4" > > > TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1143677550"> > > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" > TN="488" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" > > > TN="1688" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="disk_free" VAL="57.776" TYPE="double" UNITS="GB" > > TN="128" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="cpu_speed" VAL="2612" TYPE="uint32" UNITS="MHz" > > TN="488" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="part_max_used" VAL="52.7" TYPE="float" UNITS="" > > TN="128" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB" > > TN="488" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" > > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="boottime" VAL="1143590767" TYPE="uint32" UNITS="s" > > > TN="488" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS="" > > TN="488" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" > TN="488" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" > > > UNITS="" TN="488" TMAX="1200" DMAX="0" SLOPE="zero" > > SOURCE="gmond"/> > > > <METRIC NAME="cpu_user" VAL="93.6" TYPE="float" UNITS="%" TN="27" > > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="cpu_system" VAL="0.6" TYPE="float" UNITS="%" > TN="27" > > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="load_one" VAL="2.03" TYPE="float" UNITS="" TN="68" > > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="8" > > TMAX="950" > > > DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="proc_total" VAL="128" TYPE="uint32" UNITS="" TN="8" > > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_free" VAL="1328356" TYPE="uint32" UNITS="KB" > > TN="8" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" TN="8" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_buffers" VAL="199232" TYPE="uint32" UNITS="KB" > > TN="8" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_cached" VAL="4569200" TYPE="uint32" UNITS="KB" > > TN="8" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="swap_free" VAL="2101964" TYPE="uint32" UNITS="KB" > > TN="8" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="188" > > TMAX="300" > > > DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="bytes_out" VAL="6066.85" TYPE="float" > > UNITS="bytes/sec" > > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC > > > NAME="bytes_in" VAL="203006.30" TYPE="float" > > UNITS="bytes/sec" > > > TN="8" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC > > > NAME="numthreads" VAL="2" TYPE="int8" UNITS="" TN="324" TMAX="60" > > > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> <METRIC NAME="numjobs" > > > VAL="2" TYPE="int8" UNITS="" TN="324" > > TMAX="60" > > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > > </HOST> > > > > > > > > > Good host: > > > > > > gmond: > > > Processing a Ganglia_message from goodhost > > > gmetad: > > > Updating host goodhost, metric numjobs > > > server_thread() received request > > > "/Opteron_Production-Desktop_Droid_Cluster/goodhost" from > 127.0.0.1 > > > XML: > > > <HOST NAME="goodhost" IP="10.73.16.225" REPORTED="1143682838" > > TN="1" > > > TMAX="20" DMAX="0" LOCATION="unspecified" > > GMOND_STARTED="1143137198"> > > > <METRIC NAME="cpu_num" VAL="2" TYPE="uint16" UNITS="CPUs" > TN="838" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="disk_total" VAL="71.047" TYPE="double" UNITS="GB" > > > TN="2039" TMAX="1200" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="disk_free" VAL="46.667" TYPE="double" UNITS="GB" > > TN="178" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="cpu_speed" VAL="2411" TYPE="uint32" UNITS="MHz" > > TN="838" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="part_max_used" VAL="70.5" TYPE="float" UNITS="" > > TN="178" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_total" VAL="8147640" TYPE="uint32" UNITS="KB" > > TN="838" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="swap_total" VAL="2104504" TYPE="uint32" UNITS="KB" > > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="boottime" VAL="1142553979" TYPE="uint32" UNITS="s" > > > TN="838" TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="machine_type" VAL="x86_64" TYPE="string" UNITS="" > > TN="838" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="os_name" VAL="Linux" TYPE="string" UNITS="" > TN="838" > > > TMAX="1200" DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="os_release" VAL="2.6.13.4_K8+NUMA+NV" TYPE="string" > > > UNITS="" TN="838" TMAX="1200" DMAX="0" SLOPE="zero" > > SOURCE="gmond"/> > > > <METRIC NAME="cpu_user" VAL="73.1" TYPE="float" UNITS="%" TN="8" > > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="cpu_system" VAL="3.9" TYPE="float" UNITS="%" TN="8" > > > > TMAX="90" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="load_one" VAL="1.99" TYPE="float" UNITS="" TN="9" > > > TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="proc_run" VAL="2" TYPE="uint32" UNITS="" TN="149" > > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="proc_total" VAL="156" TYPE="uint32" UNITS="" > TN="149" > > > > > TMAX="950" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_free" VAL="2359176" TYPE="uint32" UNITS="KB" > > TN="28" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_shared" VAL="0" TYPE="uint32" UNITS="KB" > TN="28" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_buffers" VAL="36384" TYPE="uint32" UNITS="KB" > > TN="28" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="mem_cached" VAL="4162056" TYPE="uint32" UNITS="KB" > > TN="28" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="swap_free" VAL="1786428" TYPE="uint32" UNITS="KB" > > TN="28" > > > TMAX="180" DMAX="0" SLOPE="both" SOURCE="gmond"/> > > > <METRIC NAME="gexec" VAL="ON" TYPE="string" UNITS="" TN="229" > > TMAX="300" > > > DMAX="0" SLOPE="zero" SOURCE="gmond"/> > > > <METRIC NAME="bytes_out" VAL="305162.19" TYPE="float" > > UNITS="bytes/sec" > > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC > > > > NAME="bytes_in" VAL="40802.30" TYPE="float" > > UNITS="bytes/sec" > > > TN="28" TMAX="300" DMAX="0" SLOPE="both" SOURCE="gmond"/> <METRIC > > > > NAME="numthreads" VAL="1" TYPE="int8" UNITS="" TN="844" TMAX="60" > > > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> <METRIC NAME="numjobs" > > > VAL="1" TYPE="int8" UNITS="" TN="844" > > TMAX="60" > > > DMAX="0" SLOPE="both" SOURCE="gmetric"/> > > > </HOST> > > > > > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by xPML, a groundbreaking scripting > > language that extends applications into web and mobile media. > Attend > > the live webcast > > and join the prime developer group breaking into this new coding > > territory! > > > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > > _______________________________________________ > > Ganglia-general mailing list [EMAIL PROTECTED] > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > > > > ------------------------------------------------------ > Martin Knoblauch > email: k n o b i AT knobisoft DOT de > www: http://www.knobisoft.de > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language that extends applications into web and mobile media. Attend > the > live webcast and join the prime developer group breaking into this > new > coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Ganglia-general mailing list [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > ------------------------------------------------------------------------ > For more information about Barclays Capital, please > visit our web site at http://www.barcap.com. > > > Internet communications are not secure and therefore the Barclays > Group does not accept legal responsibility for the contents of this > message. Although the Barclays Group operates anti-virus programmes, > > it does not accept responsibility for any damage whatsoever that is > caused by viruses being passed. Any views or opinions presented are > solely those of the author and do not necessarily represent those of > the > Barclays Group. Replies to this email may be monitored by the > Barclays > Group for operational or business reasons. > > ------------------------------------------------------------------------ > > > ------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de