I am very hungry and am going to go get a burrito.
I suspect something is not being updated between loops or something,
because check out what my copious print statements tell me in gmetad:
RRD_update(): error expected 1 data source readings (got 0) from
/www/gmetad/rrds/SOME_CLUSTER/SOME_HOST/cpu_idle.rrd:... updating
/www/gmetad/rrds/SOME_CLUSTER/SOME_HOST/cpu_idle.rrd with value N:0.6
process_xml.c: Call to
write_data_to_rrd(SOME_CLUSTER,SOME_HOST,cpu_idle,0.6) was nonzero ...
RRD_update(): error expected 1 data source readings (got 0) from
/www/gmetad/rrds/SOME_CLUSTER/SOME_HOST/cpu_idle.rrd:... updating
/www/gmetad/rrds/SOME_OTHER_CLUSTER/SOME_OTHER_HOST/mem_cached.rrd with
value N:0
process_xml.c: Call to
write_data_to_rrd(SOME_OTHER_CLUSTER,SOME_OTHER_HOST,mem_cached,0) was
nonzero ...
So first of all, 0.6 is not a zero value. That's a little freaky. Second,
the second value passed *is* zero, but RRD_update() is returning the same
error message. So either the error string wasn't updated, or the
rrd_update() string wasn't updated. This data appears to be being shared
between the two threads. And that just don't make sense.
I'm also noticing a couple of RRD writes failing due to locking issues.
But those are fairly few and far between.
This error, btw, trips the xml_data.rval flag and causes no further RRDs to
be updated in this pass. Bummer. But not necessarily the same thing as
what's causing ALL my sources to intermittently "die."
Anyway, I'm just sharing here before I leave today. If someone wants to
take this up before I do in the morning, have at it. :)