Because of a still unexplained bug in the kernel we are running, the numbers reported in /proc/stat suddenly jumped up the other day on us. This exposed a bug in gmond that caused it to report bogus cpu_* numbers. The problem is caused because the total_jiffies_func adds up the 4 cpu states (32-bit longs) and returns the result as a simple 32-bit long. Even without a problem in our kernel, this would have eventually overflowed on us after 248 days (497 days on a single-cpu system). This isn't a big bug, but it is not unreasonable for some nodes in a cluster to have an uptime of 248 days. I would suggest to at least change the return value of the total_jiffies_func to be a long long, which would work now, but wouldn't be good enough on recent 2.5 kernels and above since they have started using 64-bit jiffies and this function returns the sum of four of these numbers. Maybe it should sum these and return a double.
~Jason -- /------------------------------------------------------------------\ | Jason A. Smith Email: [EMAIL PROTECTED] | | Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 | | Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 | | Upton, NY 11973-5000 | \------------------------------------------------------------------/