Hi,

I am a collegue of Mike Hom, who has recently submitted a fix on "kstat cpu_info". I've recently started playing around with Ganglia, and I must say it's really nice!

On Solaris, we've run into a case where gmond would core when we had non-sequential "on-line" CPUs.

On the code segment below (file: gmond/machines/solaris.c), you can see that it calls the function "p_online" to retrieve the status of the CPUs.

  510  /* Modified by Robert Petkus <[EMAIL PROTECTED]>
  511   * Get stats only for online CPUs. Previously, gmond segfaulted if
  512   * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
  513   * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
  514  */
  515     for (i = 0; cpu_id > 0; i++)
  516     {
  517        if (p_online(i, P_STATUS) == -1 && errno == EINVAL) continue;

It skips the loop if the CPU id referred doesn't exist on the machine (return val == -1).
This routine guards against referencing non-existent CPU Id.
A simple scenario for this would be CPUs installed on slots that were non-sequential. (e.g. Slot 1 and then Slot 3, skipping Slot 2)

However, this routine only caters to the scenario where CPU is unavailable (not installed). In more recent hardware, we have this thing called "EscOD control" that hot swaps the CPU into the OS under high load. It simply toggles CPUs online and offline under appropriate circumstances. Now, the problem is that this toggling of CPUs sometimes mess up the order of CPUs that are online, thereby making the IDs of these on-line CPUs inconsistent.

Even if the CPU is offline, it is still reported as being available, since the hardware actually exist: it's just not being used! Now, we've just hit another case where the CPU configuration breaks the gmond daemon. Here is the CPU config that broke gmond on Solaris with 16 CPUs, out of which just 12 are online.

[EMAIL PROTECTED]:/opt/ganglia/bin> psrinfo
0       on-line   since 04/13/2004 00:29:06
1       on-line   since 04/13/2004 00:29:08
2       on-line   since 04/13/2004 00:29:08
3       on-line   since 04/13/2004 00:29:08
4       on-line   since 04/13/2004 00:29:08
5       on-line   since 04/13/2004 00:29:08
6       on-line   since 04/13/2004 00:29:08
7       on-line   since 04/13/2004 00:29:08
8       off-line  since 04/20/2004 03:30:32    << problem
9       on-line   since 05/13/2004 08:10:28
10      on-line   since 04/30/2004 09:10:23
11      on-line   since 04/22/2004 12:55:59
12      on-line   since 04/22/2004 11:20:13
13      off-line  since 05/13/2004 12:40:27
14      off-line  since 05/19/2004 15:10:32
15      off-line  since 05/20/2004 08:35:19

Notice the sequence of the on-line CPU is broken at slot 8.
The for loop will iterate exactly 12 times, since it captures the number of CPUs that are on-line. When the loop reaches the 8th CPU slot, it will segfault because the CPU is available but the stats can not be retrieved.

I've made a one-liner change to the code to skip the loop if the CPU is offline, and iterate for the next CPU id. This fixes my problem for the time being, but there might be other cases where varying CPU status might cause problems.

  509  /* Modified by Robert Petkus <[EMAIL PROTECTED]>
  510   * Get stats only for online CPUs. Previously, gmond segfaulted if
  511   * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
  512   * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
  513  */
  514     for (i = 0; cpu_id > 0; i++)
  515     {
  516
  517          /**
  518           * this checked for unavailable CPU's but offline CPU's
  519           * need to be considered as well.
  520           * Status id 1 means P_OFFLINE
  521          */
  522          int n = p_online(i, P_STATUS);
  523        if (n == 1) continue;   // skip if CPU is offline
  524        if (n == -1 && errno == EINVAL) continue;


FYI, here are the CPU status identifiers:

/usr/lib/sys/processor.h
#define P_OFFLINE 1 /* processor is offline, as quiet as possible */
#define P_ONLINE        2       /* processor online */
#define P_STATUS 3 /* value passed to p_online to request status */
#define P_BAD           4       /* unused so far but defined by USL */
#define P_POWEROFF      5       /* processor is powered off */
#define P_NOINTR 6 /* processor online, but no I/O interrupts */


Yes as you can see, we are deploying ganglia on big machines with not-so-friendly CPU config. =)

Thanks!

JB



Reply via email to