Hi Tom,
you're absolutely correct - there was bug in the cpu usage which incorrectly
capped the CPU usage to the fraction equivalent to single CPU core. As you
mentioned, the problem could occur when monitoring CPU usage of multi-threaded
processes on multi-core machines.
Thanks for the patch, it will be part of the next release.
Best regards,
Martin
--- monit/trunk/src/process.c (original)
+++ monit/trunk/src/process.c Wed Jan 11 19:55:27 2012
@@ -233,8 +233,8 @@
/* The cpu_percent may be set already (for example by HPUX module) */
if (pt[i].cpu_percent == 0 && pt[i].cputime_prev != 0 && pt[i].cputime
!= 0 && pt[i].cputime > pt[i].cputime_prev) {
pt[i].cpu_percent = (int)((1000 * (double)(pt[i].cputime -
pt[i].cputime_prev) / (pt[i].time - pt[i].time_prev)) / systeminfo.cpus);
- if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
- pt[i].cpu_percent = 1000 / systeminfo.cpus;
+ if (pt[i].cpu_percent > 1000)
+ pt[i].cpu_percent = 1000;
}
} else {
pt[i].cputime_prev = 0;
On Jan 6, 2012, at 9:32 PM, Tom Pepper wrote:
> Hi, Martin:
>
> Can you clarify what exactly these two lines do in process.c's cpu percentage
> calculation?
>
> if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
> pt[i].cpu_percent = 1000 / systeminfo.cpus;
>
> They're causing total cpu to be misreported when processes use a large amount
> of CPU and many cores are present. Shouldn't the "/ systeminfo.cpus" be
> dropped in both cases? I assume it's meant to keep any strange math from
> causing process cpu percentage to ever exceed 100%.
>
> For example, with a 120s query delay, a process I have on a 24 core box
> calculates with process.c's logic as:
>
> cputime = 4809915 cputime_prev = 4803601 (delta 6314)
> time = 13258814089.516930 time_prev = 13258812889.395201 (delta 1200)
>
> cputime - cputime_prev / time - time_prev = 6314/1200 = 5.26
> 1000 * 5.26 / 24 cpus = 219 "pt[i].cpu_percent" (which appears to represent
> 21.9% in monitese), which is accurate.
>
> 1000 / num_cpus is 41.6 on my box. since 219 >> 41.6 it gets cut back to
> 41.6.
>
> Thanks,
> -t
>
>
> On Jan 5, 2012, at 4:33 AM, Martin Pala wrote:
>
>> Yes, Wayne is correct and the usage is computed exactly as he described.
>> Monit takes the summary of all CPU cores as 100%.
>>
>> Regards,
>> Martin
>>
>>
>>
>> On Jan 5, 2012, at 10:54 AM, Lawrence, Wayne wrote:
>>
>>> May be wrong and i am sure someone will correct me if i am but it appears
>>> the way the cpu usage is worked out against the multiple cores is why you
>>> are getting this output.
>>>
>>> The way i worked it out is the way i believe monit works it out and the
>>> maths sort of make sense.
>>>
>>> 24 cores 24 x 100% = 2400
>>>
>>> so if you divide 2400 by your usage from top
>>>
>>> 2400 / 578 = 4.2
>>>
>>> which would give you your percentage shown in monit.
>>>
>>> Regards
>>>
>>> Wayne
>>>
>>>
>>>
>>>
>>> On 5 January 2012 08:13, Tom Pepper <[email protected]> wrote:
>>> Hello:
>>>
>>> I have a number of high-CPU processes that run on 24-core boxes configured
>>> e.g.:
>>>
>>> check process emr-enc01-01 with pidfile
>>> /var/run/tada_liveenc_emr-enc01-01.pid
>>> start program = "/usr/local/tada/launch.sh -c emr-enc01-01"
>>> stop program = "/bin/bash -c 'kill -s SIGTERM `/bin/cat
>>> /var/run/tada_liveenc_emr-enc01-01.pid`'"
>>> if totalmem > 80% then alert
>>> if totalmem > 90% then restart
>>> if totalcpu < 10% for 10 cycles then alert
>>>
>>> These processes create pidfiles which match correctly in top as:
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>
>>> 1710 root 20 0 3064m 1.2g 7808 S 578 15.8 47:31.53 tada_liveenc
>>>
>>> 1866 root 20 0 2954m 1.3g 7804 S 545 16.7 45:18.52 tada_liveenc
>>>
>>>
>>> However, monit sees these as a completely different total CPU usage:
>>>
>>> Process 'emr-enc01-01'
>>> status Running
>>> monitoring status Monitored
>>> pid 1710
>>> parent pid 1
>>> uptime 8m
>>> children 0
>>> memory kilobytes 1372300
>>> memory kilobytes total 1372300
>>> memory percent 16.7%
>>> memory percent total 16.7%
>>> cpu percent 4.1%
>>> cpu percent total 4.1%
>>> data collected Thu, 05 Jan 2012 00:05:49
>>>
>>> Process 'emr-enc01-02'
>>> status Running
>>> monitoring status Monitored
>>> pid 1866
>>> parent pid 1
>>> uptime 8m
>>> children 0
>>> memory kilobytes 1362240
>>> memory kilobytes total 1362240
>>> memory percent 16.6%
>>> memory percent total 16.6%
>>> cpu percent 4.1%
>>> cpu percent total 4.1%
>>> data collected Thu, 05 Jan 2012 00:05:49
>>>
>>> Any thoughts on why this might be happening? Hosts are ubuntu natty. The
>>> master processes themselves spawn about 150 threads (not forks).
>>>
>>> FYI:
>>>
>>> 662 root@enc01[tada]: $ uname -m
>>> x86_64
>>>
>>> 663 root@enc01[tada]: $ file `which monit`
>>> /usr/local/bin/monit: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
>>> dynamically linked (uses shared libs), for GNU/Linux 2.6.0, not stripped
>>>
>>> 664 root@enc01[tada]: $ monit -V
>>> This is Monit version 5.3.2
>>> Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.
>>>
>>> Thanks in advance,
>>> -Tom
>>>
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>>
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general