Because of a still unexplained bug in the kernel we are running, the
numbers reported in /proc/stat suddenly jumped up the other day on us. 
This exposed a bug in gmond that caused it to report bogus cpu_*
numbers.  The problem is caused because the total_jiffies_func adds up
the 4 cpu states (32-bit longs) and returns the result as a simple
32-bit long.  Even without a problem in our kernel, this would have
eventually overflowed on us after 248 days (497 days on a single-cpu
system).  This isn't a big bug, but it is not unreasonable for some
nodes in a cluster to have an uptime of 248 days.  I would suggest to at
least change the return value of the total_jiffies_func to be a long
long, which would work now, but wouldn't be good enough on recent 2.5
kernels and above since they have started using 64-bit jiffies and this
function returns the sum of four of these numbers.  Maybe it should sum
these and return a double.

~Jason


-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/


Reply via email to