On Tue, Aug 26, 2008 at 04:35:51PM +0100, [EMAIL PROTECTED] wrote:
> 
> Has anyone seen a crash like this in gmetad:
> 
> (gdb) bt
> #0  0x08056e74 in write_RRA_row ()
> #1  0x0805808e in _rrd_update ()
> #2  0x08059602 in rrd_update_r ()
> #3  0x080596d3 in rrd_update ()

from here up to the crash is inside rrdtool (statically linked into gmetad)
which version of rrdtool was used to build this gmetad and could you rebuild
it with a newer version of it to see if the problem goes away?

> Problem observed in gmetad 3.0.4.  I've seen this more than once, and
> wouldn't mind finding out what causes it.

gmetad in 3.1.x has several fixes that could be useful too (some of them
scheduled for backport with 3.0.8 when released) so if you are going to
build a newer gmetad as an alternative might be a good idea to upgrade as
well.

> Examining it with gdb, I found the XML being processed:
> 
> ...
> <HOST NAME=\"testserver\" IP=\"x.x.x.x\" REPORTED=\"1219754167\"
> TN=\"0\" TMAX=\"20\" DMAX=\"172800\" LOCATION=\"\"
> GMOND_STARTED=\"1219495176\">
> ...
> <METRIC NAME=\"proc_total\" VAL=\"62\" TYPE=\"uint32\" UNITS=\"\"
> TN=\"343\" TMAX=\"950\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>

an empty UNITS, that is no longer the case starting with 3.0.5, even if AFAIK
wasn't ever tied to a gmetad crash but was changed mainly for aesthetic
reasons in the frontend.

> I then looked at the last write of the metric file:
> 
> date -r proc_total.rrd +%s
> 1219754157
> 
> and the most recent successful write to disk before the crash:
> 
>  find /var/lib/ganglia/rrds/ -type f -name '*rrd' -exec date -r '{}' +%s
> \; | sort | tail -1
> 1219754168
> 
> Therefore, the time of the crash was at or after 1219754168

that is 11 seconds later and very likely not the direct cause of the failure,
which file was the one that was last updated then?

> I'll keep a core file and logs around in case anyone has suggestions to
> investigate this crash.

if you can reproduce the crash very easily will be probably a good idea to
get an unstripped gmetad so that we get more information about the core (you
can also attach symbols to an stripped binary but you have to get the symbols
out while building it)

since the problem is happening inside rrdtool and is happening on code that
has changed a lot since, might be also a good idea to see if you can reproduce
it with the latest released code as this might had been fixed already
accidentally there.

Carlo

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to