[Ganglia-developers] Solaris CPU (p_online) report bug.

Jeong Bae Kim Tue, 01 Jun 2004 15:06:29 -0700

Hi,

I am a collegue of Mike Hom, who has recently submitted a fix on "kstatcpu_info".I've recently started playing around with Ganglia, and I must say it'sreally nice!

On Solaris, we've run into a case where gmond would core when we hadnon-sequential "on-line" CPUs.

On the code segment below (file: gmond/machines/solaris.c), you can see thatit calls the function "p_online" to retrieve the status of the CPUs.


  510  /* Modified by Robert Petkus <[EMAIL PROTECTED]>
  511   * Get stats only for online CPUs. Previously, gmond segfaulted if
  512   * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
  513   * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
  514  */
  515     for (i = 0; cpu_id > 0; i++)
  516     {
  517        if (p_online(i, P_STATUS) == -1 && errno == EINVAL) continue;

It skips the loop if the CPU id referred doesn't exist on the machine(return val == -1).

This routine guards against referencing non-existent CPU Id.

A simple scenario for this would be CPUs installed on slots that werenon-sequential. (e.g. Slot 1 and then Slot 3, skipping Slot 2)

However, this routine only caters to the scenario where CPU is unavailable(not installed).In more recent hardware, we have this thing called "EscOD control" that hotswaps the CPU into the OS under high load. It simply toggles CPUs online andoffline under appropriate circumstances.Now, the problem is that this toggling of CPUs sometimes mess up the orderof CPUs that are online, thereby making the IDs of these on-line CPUsinconsistent.

Even if the CPU is offline, it is still reported as being available, sincethe hardware actually exist: it's just not being used! Now, we've just hitanother case where the CPU configuration breaks the gmond daemon.Here is the CPU config that broke gmond on Solaris with 16 CPUs, out ofwhich just 12 are online.


[EMAIL PROTECTED]:/opt/ganglia/bin> psrinfo
0       on-line   since 04/13/2004 00:29:06
1       on-line   since 04/13/2004 00:29:08
2       on-line   since 04/13/2004 00:29:08
3       on-line   since 04/13/2004 00:29:08
4       on-line   since 04/13/2004 00:29:08
5       on-line   since 04/13/2004 00:29:08
6       on-line   since 04/13/2004 00:29:08
7       on-line   since 04/13/2004 00:29:08
8       off-line  since 04/20/2004 03:30:32    << problem
9       on-line   since 05/13/2004 08:10:28
10      on-line   since 04/30/2004 09:10:23
11      on-line   since 04/22/2004 12:55:59
12      on-line   since 04/22/2004 11:20:13
13      off-line  since 05/13/2004 12:40:27
14      off-line  since 05/19/2004 15:10:32
15      off-line  since 05/20/2004 08:35:19

Notice the sequence of the on-line CPU is broken at slot 8.

The for loop will iterate exactly 12 times, since it captures the number ofCPUs that are on-line.When the loop reaches the 8th CPU slot, it will segfault because the CPU isavailable but the stats can not be retrieved.

I've made a one-liner change to the code to skip the loop if the CPU isoffline, and iterate for the next CPU id. This fixes my problem for the timebeing, but there might be other cases where varying CPU status might causeproblems.


  509  /* Modified by Robert Petkus <[EMAIL PROTECTED]>
  510   * Get stats only for online CPUs. Previously, gmond segfaulted if
  511   * the CPUs were not numbered sequentially; i.e., cpu0, cpu2, etc.
  512   * Tested on 64 bit Solaris 8 and 9 with GCC 3.3 and 3.3.2
  513  */
  514     for (i = 0; cpu_id > 0; i++)
  515     {
  516
  517          /**
  518           * this checked for unavailable CPU's but offline CPU's
  519           * need to be considered as well.
  520           * Status id 1 means P_OFFLINE
  521          */
  522          int n = p_online(i, P_STATUS);
  523        if (n == 1) continue;   // skip if CPU is offline
  524        if (n == -1 && errno == EINVAL) continue;


FYI, here are the CPU status identifiers:

/usr/lib/sys/processor.h

#define P_OFFLINE 1 /* processor is offline, as quiet aspossible */

#define P_ONLINE        2       /* processor online */

#define P_STATUS 3 /* value passed to p_online to requeststatus */

#define P_BAD           4       /* unused so far but defined by USL */
#define P_POWEROFF      5       /* processor is powered off */

#define P_NOINTR 6 /* processor online, but no I/O interrupts*/

Yes as you can see, we are deploying ganglia on big machines withnot-so-friendly CPU config. =)


Thanks!

JB

[Ganglia-developers] Solaris CPU (p_online) report bug.

Reply via email to