I just discovered the conversation about collectl and saw in a list archive and thought I'd jump in. When I first wrote collectl over 10 years ago the we felt we needed a more powerful/flexible tool than sar to work with out High Performance customers at HP. For example, we needed to record a lot more types of information than sar such as Infiniband and Lustre File System statistics. How about impi data such as temperatures or fan speeds? Power consumption? Anybody remember Quadrics interconnect? Collectl does that too, but there's a whole lot more to collectl than just types of data it collects.
Rather than repeating what's on the website - http://collectl.sourceforge.net/, you can read some of the features yourselves. Suffice it to say it runs on some of the worlds largest clusters, sampling hundreds of data points every 10 seconds while using < 0.1% of a CPU. But even more are 2 utilities that make it even more useful - http://collectl-utils.sourceforge.net/. colplot lets you produce high resolution plot for dozens (or more) of nodes via a browser. colmux allows you to monitor hundreds of nodes in real-time from a single window, much like top. but unlike top which only shows top processes, colmux can do that as well as show top-anything! at least anything collectl can report. for example, if you had dozens of servers, each with dozens of disks, you can use colmux to find the disks with the longest wait time. or how about the systems with the highest temps? anyhow, see for yourself and check it out. -mark