Re: total cpu process bug?

Martin Pala Wed, 11 Jan 2012 11:02:19 -0800

Hi Tom,

you're absolutely correct - there was bug in the cpu usage which incorrectly 
capped the CPU usage to the fraction equivalent to single CPU core. As you 
mentioned, the problem could occur when monitoring CPU usage of multi-threaded 
processes on multi-core machines.


Thanks for the patch, it will be part of the next release.

Best regards,
Martin



--- monit/trunk/src/process.c (original)
+++ monit/trunk/src/process.c Wed Jan 11 19:55:27 2012
@@ -233,8 +233,8 @@
      /* The cpu_percent may be set already (for example by HPUX module) */
      if (pt[i].cpu_percent  == 0 && pt[i].cputime_prev != 0 && pt[i].cputime 
!= 0 && pt[i].cputime > pt[i].cputime_prev) {
        pt[i].cpu_percent = (int)((1000 * (double)(pt[i].cputime - 
pt[i].cputime_prev) / (pt[i].time - pt[i].time_prev)) / systeminfo.cpus);
-        if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
-          pt[i].cpu_percent = 1000 / systeminfo.cpus;
+        if (pt[i].cpu_percent > 1000)
+          pt[i].cpu_percent = 1000;
      }
    } else {
      pt[i].cputime_prev = 0;




On Jan 6, 2012, at 9:32 PM, Tom Pepper wrote:

> Hi, Martin:
> 
> Can you clarify what exactly these two lines do in process.c's cpu percentage 
> calculation?
> 
>         if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
>           pt[i].cpu_percent = 1000 / systeminfo.cpus;
> 
> They're causing total cpu to be misreported when processes use a large amount 
> of CPU and many cores are present.  Shouldn't the "/ systeminfo.cpus" be 
> dropped in both cases?  I assume it's meant to keep any strange math from 
> causing process cpu percentage to ever exceed 100%.
> 
> For example, with a 120s query delay, a process I have on a 24 core box 
> calculates with process.c's logic as:
> 
> cputime = 4809915 cputime_prev = 4803601 (delta 6314)
> time = 13258814089.516930 time_prev = 13258812889.395201 (delta 1200)
> 
> cputime - cputime_prev / time - time_prev = 6314/1200 = 5.26
> 1000 * 5.26 / 24 cpus = 219 "pt[i].cpu_percent" (which appears to represent 
> 21.9% in monitese), which is accurate.
> 
> 1000 / num_cpus is 41.6 on my box.  since 219 >> 41.6 it gets cut back to 
> 41.6.
> 
> Thanks,
> -t
> 
> 
> On Jan 5, 2012, at 4:33 AM, Martin Pala wrote:
> 
>> Yes, Wayne is correct and the usage is computed exactly as he described. 
>> Monit takes the summary of all CPU cores as 100%.
>> 
>> Regards,
>> Martin
>> 
>> 
>> 
>> On Jan 5, 2012, at 10:54 AM, Lawrence, Wayne wrote:
>> 
>>> May be wrong and i am sure someone will correct me if i am but it appears 
>>> the way the cpu usage is worked out against the multiple cores is why you 
>>> are getting this output.
>>>  
>>> The way i worked it out is the way i believe monit works it out and the 
>>> maths sort of make sense.
>>>  
>>> 24 cores  24 x 100% = 2400
>>>  
>>> so if you divide 2400 by your usage from top
>>>  
>>> 2400 / 578 = 4.2
>>>  
>>> which would give you your percentage shown in monit.
>>>  
>>> Regards
>>>  
>>> Wayne
>>>  
>>> 
>>> 
>>>  
>>> On 5 January 2012 08:13, Tom Pepper <[email protected]> wrote:
>>> Hello:
>>> 
>>> I have a number of high-CPU processes that run on 24-core boxes configured 
>>> e.g.:
>>> 
>>> check process emr-enc01-01 with pidfile 
>>> /var/run/tada_liveenc_emr-enc01-01.pid
>>>   start program = "/usr/local/tada/launch.sh -c emr-enc01-01"
>>>   stop program = "/bin/bash -c 'kill -s SIGTERM `/bin/cat 
>>> /var/run/tada_liveenc_emr-enc01-01.pid`'"
>>>   if totalmem > 80% then alert
>>>   if totalmem > 90% then restart
>>>   if totalcpu < 10% for 10 cycles then alert
>>> 
>>> These processes create pidfiles which match correctly in top as:
>>> 
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
>>>                                                     
>>>  1710 root      20   0 3064m 1.2g 7808 S  578 15.8  47:31.53 tada_liveenc   
>>>                                                      
>>>  1866 root      20   0 2954m 1.3g 7804 S  545 16.7  45:18.52 tada_liveenc   
>>>   
>>> 
>>> However, monit sees these as a completely different total CPU usage:
>>> 
>>> Process 'emr-enc01-01'
>>>   status                            Running
>>>   monitoring status                 Monitored
>>>   pid                               1710
>>>   parent pid                        1
>>>   uptime                            8m 
>>>   children                          0
>>>   memory kilobytes                  1372300
>>>   memory kilobytes total            1372300
>>>   memory percent                    16.7%
>>>   memory percent total              16.7%
>>>   cpu percent                       4.1%
>>>   cpu percent total                 4.1%
>>>   data collected                    Thu, 05 Jan 2012 00:05:49
>>> 
>>> Process 'emr-enc01-02'
>>>   status                            Running
>>>   monitoring status                 Monitored
>>>   pid                               1866
>>>   parent pid                        1
>>>   uptime                            8m 
>>>   children                          0
>>>   memory kilobytes                  1362240
>>>   memory kilobytes total            1362240
>>>   memory percent                    16.6%
>>>   memory percent total              16.6%
>>>   cpu percent                       4.1%
>>>   cpu percent total                 4.1%
>>>   data collected                    Thu, 05 Jan 2012 00:05:49
>>> 
>>> Any thoughts on why this might be happening?  Hosts are ubuntu natty.  The 
>>> master processes themselves spawn about 150 threads (not forks).
>>> 
>>> FYI:
>>> 
>>> 662 root@enc01[tada]: $ uname -m
>>> x86_64
>>> 
>>> 663 root@enc01[tada]: $ file `which monit`
>>> /usr/local/bin/monit: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), 
>>> dynamically linked (uses shared libs), for GNU/Linux 2.6.0, not stripped
>>> 
>>> 664 root@enc01[tada]: $ monit -V
>>> This is Monit version 5.3.2
>>> Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.
>>> 
>>> Thanks in advance,
>>> -Tom
>>> 
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>> 
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>> 
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: total cpu process bug?

Reply via email to