I am seeing two major problems:

First, what exactly is the purpose of the new GRID tag?  It seems that
it is a mandatory part of the new gmetad system, and more importantly
the authority attribute is mandatory.  Is this correct?  It appears that
if you setup one gmetad to collect data from another gmetad that adds
this grid tag and authority attribute, then the second gmetad only
writes summary rrds, correct?  This is a problem for us because of the
way we had and would like to continue using ganglia at our computer
facility.  We would like to have an internal ganglia monitoring host to
monitor our whole facility which spans 5 separate experiment clusters. 
Then allow the individual experiments to get the xml data from our
gmetad collector and reproduce it on their own webserver without having
links redirect them to our main webfrontend.  They might also be adding
in clusters from outside of BNL that are part of their experiment.  Our
main facility gmetad would monitor everything, but its webfrontend would
not be visible outside of the BNL firewall, so passing around the
authority attribute would not work and should be optional.  I would
suggest having the ability to turn this new authority option off.  If
the authority is there then gmetad should assume that it can redirect
you there to find the rrd graphs, but if it is off then the second
gmetad that polls it should reproduce all of the rrds locally.  Does
this make sense and sound reasonable?

The second problem I have is if I just do a quick hack to turn off the
grid tag in the gmetad xml output.  In this case I think the new
timestamp patch is screwing up the second collector when it tries to
write to the rrds.  When first started, I get a lot of errors trying to
create and update the rrds.  Each time I get an error it looks like it
aborts the xml parsing and stops creating the rrds.  After waiting a
long time, finally all of the rrds get created, but I still get frequent
errors like:

Mar 14 11:36:51 www /usr/sbin/gmetad[32213]: RRD_update: illegal attempt
to update using time 1047659775 when last update time is 1047659775
(minimum one second step) 

Which cause the now famous gaps in the rrd graphs when looking at the
hour resolution.  The 2.5.2 version of gmetad does not have a problem
getting data from gmetad with the grid tag removed, but 2.5.3 does so I
can only assume it must be related to the new timestamp patch.  Does
anyone have an idea what might be wrong?  I probably won't have time to
investigate this more till next week.

Sorry for the long email,
~Jason

-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/


Reply via email to