Bug#460331: analysis

Joey Hess Tue, 11 Mar 2008 13:38:38 -0700

Here's a good explanation about what procps is doing:
http://lkml.org/lkml/2002/2/18/187
However, the problem I'm seeing is not due to an overflow.


Just after a boot:

adsdebian:~# ps >/dev/null
Unknown HZ value! (67) Assume 100.
adsdebian:~# cat /proc/uptime  /proc/stat
92.52 64.55
cpu  1336 0 1458 3773 2681 4 0 0

If I sum all the cpu values and divide by uptime, I get 100, every time.
Meanwhile, procps warns about unknown hz values that are trending toward 100
as the uptime increases. After enough uptime, the problem disappears.

adsdebian:~# ps >/dev/null
Unknown HZ value! (89) Assume 100.
adsdebian:~# cat /proc/uptime  /proc/stat
271.05 242.41
cpu  1367 0 1494 21521 2716 6 1 0

adsdebian:~# ps >/dev/null
Unknown HZ value! (91) Assume 100.
adsdebian:~# cat /proc/uptime  /proc/stat
336.21 307.28
cpu  1380 0 1510 27984 2740 6 1 0

adsdebian:/tmp# ps >/dev/null
adsdebian:/tmp# cat /proc/uptime /proc/stat
1195.29 1155.56
cpu  2319 0 1651 109945 5596 11 7 0


Now, looking at the code:

    sscanf(buf, "cpu %Lu %Lu %Lu %Lu", &user_j, &nice_j, &sys_j, &other_j);

Why are only 4 of the numbers extracted? All of them seem to be needed.
Especially on slow and disk-bound systems, the current code only
succeeds in getting a number between 95 and 105 some time after boot,
when the time the system has spent in sys+user+idle mode swamps the
iowait+irq+softirq+steal numbers.

       /proc/stat
              kernel/system  statistics.   Varies  with  architecture.  Common
              entries include:

              cpu  3357 0 4313 1362393
                     The  amount  of  time,  measured  in  units  of   USER_HZ
                     (1/100ths  of  a  second on most architectures), that the
                     system spent in user mode, user mode  with  low  priority
                     (nice),  system  mode,  and  the idle task, respectively.
                     The last value should be USER_HZ times the  second  entry
                     in the uptime pseudo-file.

                     In Linux 2.6 this line includes three additional columns:
                     iowait - time waiting for I/O to complete (since 2.5.41);
                     irq  -  time  servicing  interrupts  (since 2.6.0-test4);
                     softirq - time servicing softirqs (since 2.6.0-test4).

                     Since Linux 2.6.11, there is an eighth  column,  steal  -
                     stolen  time,  which is the time spent in other operating
                     systems when running in a virtualized environment

Based on this, it seems right to add up all of the values if all are
available. (For values of "right" that assume this gross approach is the
right way to get the Hz value in the first place..)



With that said, on my laptop, I have:

2677820.60 1205073.58
cpu  17764500 386487 3214308 117025796 3022994 318693 296809 0 0

Using the first 4 numbers yeilds 52, while adding all yeilds 53, which
would be an unknown Hz value with the current code.

-- 
see shy jo

signature.asc
Description: Digital signature

Bug#460331: analysis

Reply via email to