Hi,
I am a collegue of Mike Hom, who has recently submitted a fix on "kstat
cpu_info".
I've recently started playing around with Ganglia, and I must say it's
really nice!
On Solaris, we've run into a case where gmond would core when we had
non-sequential "on-line" CPUs.
On the code segment below (file: gmond/machines/solaris.c), you can see that
it calls the function "p_online" to retrieve the status of the CPUs.
510 /* Modified by Robert Petkus <[EMAIL PROTECTED]>
511 * Get stats only for online CPUs. Previously, gmond segfaulted if
512 * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
513 * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
514 */
515 for (i = 0; cpu_id > 0; i++)
516 {
517 if (p_online(i, P_STATUS) == -1 && errno == EINVAL) continue;
It skips the loop if the CPU id referred doesn't exist on the machine
(return val == -1).
This routine guards against referencing non-existent CPU Id.
A simple scenario for this would be CPUs installed on slots that were
non-sequential. (e.g. Slot 1 and then Slot 3, skipping Slot 2)
However, this routine only caters to the scenario where CPU is unavailable
(not installed).
In more recent hardware, we have this thing called "EscOD control" that hot
swaps the CPU into the OS under high load. It simply toggles CPUs online and
offline under appropriate circumstances.
Now, the problem is that this toggling of CPUs sometimes mess up the order
of CPUs that are online, thereby making the IDs of these on-line CPUs
inconsistent.
Even if the CPU is offline, it is still reported as being available, since
the hardware actually exist: it's just not being used! Now, we've just hit
another case where the CPU configuration breaks the gmond daemon.
Here is the CPU config that broke gmond on Solaris with 16 CPUs, out of
which just 12 are online.
[EMAIL PROTECTED]:/opt/ganglia/bin> psrinfo
0 on-line since 04/13/2004 00:29:06
1 on-line since 04/13/2004 00:29:08
2 on-line since 04/13/2004 00:29:08
3 on-line since 04/13/2004 00:29:08
4 on-line since 04/13/2004 00:29:08
5 on-line since 04/13/2004 00:29:08
6 on-line since 04/13/2004 00:29:08
7 on-line since 04/13/2004 00:29:08
8 off-line since 04/20/2004 03:30:32 << problem
9 on-line since 05/13/2004 08:10:28
10 on-line since 04/30/2004 09:10:23
11 on-line since 04/22/2004 12:55:59
12 on-line since 04/22/2004 11:20:13
13 off-line since 05/13/2004 12:40:27
14 off-line since 05/19/2004 15:10:32
15 off-line since 05/20/2004 08:35:19
Notice the sequence of the on-line CPU is broken at slot 8.
The for loop will iterate exactly 12 times, since it captures the number of
CPUs that are on-line.
When the loop reaches the 8th CPU slot, it will segfault because the CPU is
available but the stats can not be retrieved.
I've made a one-liner change to the code to skip the loop if the CPU is
offline, and iterate for the next CPU id. This fixes my problem for the time
being, but there might be other cases where varying CPU status might cause
problems.
509 /* Modified by Robert Petkus <[EMAIL PROTECTED]>
510 * Get stats only for online CPUs. Previously, gmond segfaulted if
511 * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
512 * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
513 */
514 for (i = 0; cpu_id > 0; i++)
515 {
516
517 /**
518 * this checked for unavailable CPU's but offline CPU's
519 * need to be considered as well.
520 * Status id 1 means P_OFFLINE
521 */
522 int n = p_online(i, P_STATUS);
523 if (n == 1) continue; // skip if CPU is offline
524 if (n == -1 && errno == EINVAL) continue;
FYI, here are the CPU status identifiers:
/usr/lib/sys/processor.h
#define P_OFFLINE 1 /* processor is offline, as quiet as
possible */
#define P_ONLINE 2 /* processor online */
#define P_STATUS 3 /* value passed to p_online to request
status */
#define P_BAD 4 /* unused so far but defined by USL */
#define P_POWEROFF 5 /* processor is powered off */
#define P_NOINTR 6 /* processor online, but no I/O interrupts
*/
Yes as you can see, we are deploying ganglia on big machines with
not-so-friendly CPU config. =)
Thanks!
JB