Re: CPU usage in Monit

Martin Pala Thu, 08 Dec 2011 07:10:26 -0800

I'm sorry, the hint won't work, as when the manual action is done, its handling 
is prioritized and the rest of services are handled when the action completed.


The fix for the problem is available here, please can you try it?:
http://www.mmonit.com/tmp/monit-5.3.1p3.tar.gz

Regards,
Martin



On Dec 8, 2011, at 3:09 PM, Lawrence, Wayne wrote:

> Hi Martin,
>  
> Actually the system check is just after the web server setup and before the 
> apache checks, I have basically modified the default monitrc file and have 
> not changed the order of the checks. So my check order is as follows.
>  
> check system
> check apache_bin
> check httpd
> check postfix
> check other services
>  
> if there is a change to the config i can make to remedy this i will be happy 
> to try it and report the results.
>  
> Regards
>  
> Wayne
> 
> 
>  
> On 8 December 2011 13:50, Martin Pala <[email protected]> wrote:
> Thanks, the root cause is clear now.
> 
> It seems that your configuration (most probably the apache check) uses the 
> pattern based process check and the system check is most probably defined 
> behinf the apache in monitrc. When you do restart of such service, monit 
> waits for it to start and refreshes the process list to see whether it 
> started. The process list refresh also refreshes the cpu usage statistics - 
> this happens every 5 milliseconds until the process starts or the action 
> times out. 
> 
> The CPU usage reported by monit (for example system 50%) is thus true but the 
> value comes from very short timeframe (cpu usage from last 5 milliseconds) 
> instead of full cycle (for example cpu usage from last 5 seconds) . If the 
> system check will be defined first (in front of apache check), this won't 
> happen, as it will take the initial values (from the cycle start) before the 
> apache action occurred.
> 
> => it is bug limited to specific configuration:
> 1.) "check process myproc matching …" is used
> 2.) "check system" is defined after the myproc check
> 3.) the myproc service is restarted
> 
> Workaround:
> move the "check system" ahead of other services in your monit configuration 
> file
> 
> 
> We'll fix the problem.
> 
> Thanks for help with the testing and data :)
> 
> Regards,
> Martin
> 
> 
> 
> On Dec 8, 2011, at 2:03 PM, Lawrence, Wayne wrote:
> 
>> Hi Martin,
>>  
>> did as you instructed here is the output.
>> From my untrained eye there is some serious miscalculation in the 4th 
>> CPUDEBUG statement not a clue how it arrives at that figure.
>>  
>> CPUDEBUG: used_system_memory_sysdep: time=1323349102: cpu_user=293199 
>> (-1.00%), cpu_nice=547, cpu_syst=194433 (-1.00%), cpu_idle=58991209, 
>> cpu_wait=31605 (-1.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59512589 
>> -- old_cpu_user=0, old_cpu_syst=0, old_cpu_wait=0, old_cpu_total=0
>> CPUDEBUG: check_system: time=1323349102: 
>> systeminfo.total_cpu_user_percent=-1.00%, 
>> systeminfo.total_cpu_syst_percent=-1.00%, 
>> systeminfo.total_cpu_wait_percent=-1.00%
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 
>> (0.70%), cpu_nice=547, cpu_syst=194469 (0.90%), cpu_idle=58995131, 
>> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516576 -- 
>> old_cpu_user=293199, old_cpu_syst=194433, old_cpu_wait=31605, 
>> old_cpu_total=59512589
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293227 
>> (-214748364.80%), cpu_nice=547, cpu_syst=194469 (-214748364.80%), 
>> cpu_idle=58995131, cpu_wait=31606 (-214748364.80%), cpu_irq=153, 
>> cpu_softirq=1990, cpu_total=59516576 -- old_cpu_user=293227, 
>> old_cpu_syst=194469, old_cpu_wait=31606, old_cpu_total=59516576
>> CPUDEBUG: used_system_memory_sysdep: time=1323349142: cpu_user=293229 
>> (0.00%), cpu_nice=547, cpu_syst=194473 (100.00%), cpu_idle=58995132, 
>> cpu_wait=31606 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59516583 -- 
>> old_cpu_user=293229, old_cpu_syst=194472, old_cpu_wait=31606, 
>> old_cpu_total=59516582
>> CPUDEBUG: check_system: time=1323349142: 
>> systeminfo.total_cpu_user_percent=0.00%, 
>> systeminfo.total_cpu_syst_percent=100.00%, 
>> systeminfo.total_cpu_wait_percent=0.00%
>> CPUDEBUG: used_system_memory_sysdep: time=1323349202: cpu_user=293307 
>> (0.90%), cpu_nice=547, cpu_syst=194542 (0.90%), cpu_idle=59001021, 
>> cpu_wait=31610 (0.00%), cpu_irq=153, cpu_softirq=1990, cpu_total=59522623 -- 
>> old_cpu_user=293252, old_cpu_syst=194483, old_cpu_wait=31606, 
>> old_cpu_total=59516616
>> CPUDEBUG: check_system: time=1323349202: 
>> systeminfo.total_cpu_user_percent=0.90%, 
>> systeminfo.total_cpu_syst_percent=0.90%, 
>> systeminfo.total_cpu_wait_percent=0.00%
>>  
>> Regards
>>  
>> Wayne
>> 
>> On 8 December 2011 12:50, Martin Pala <[email protected]> wrote:
>> Hi,
>> 
>> thanks for update. I have prepared the debug version, which logs the values 
>> computed based on /proc/stat right when they are ready and once again before 
>> the values are checked, so we can see whether the values were read+computed 
>> correctly and whether no memory corruption occurred before they were 
>> compared by the validation engine => there are two "CPUDEBUG" log entries 
>> per cycle.
>> 
>> You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz
>> 
>> To compile:
>> tar -xzf monit-5.3.1p2.tar.gz
>> cd monit-5.3.1p2
>> ./configure
>> make
>> 
>> Then stop existing monit instance and run new monit binary:
>> ./monit -vI  2>&1 | grep CPUDEBUG
>> 
>> after you'll replicate the problem, terminate monit with ^C and send the 
>> whole CPUDEBUG output since monit start
>> 
>> Regards,,
>> Martin
>> 
>> 
>> 
>> On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote:
>> 
>>> Hi Martin just as a side note here i disabled the cpu ssystem test and 
>>> tried again and it seems that the issue is present with all the cpu 
>>> monitoring/
>>>  
>>> I used the restarting of httpd as i knew it would trigger and alert and 
>>> these were the results.
>>>  
>>> Date:        Thu, 08 Dec 2011 10:27:59
>>>       Action:      alert
>>>       Host:        <hostname removed>
>>>       Description: cpu user usage of 100.0% matches resource limit [cpu 
>>> user usage>70.0%]
>>>  
>>> I ran vmstat 1 10 at the same time as you can see its the 4th line.
>>>  
>>> 
>>> procs -----------memory---------- ---swap-- -----io---- --system-- 
>>> -----cpu-----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id 
>>> wa st
>>>  0  0      0 739220 142536 973532    0    0     4     7   10    6  0  0 99  
>>> 0  0
>>>  0  0      0 739088 142536 973532    0    0     0     0  114  160  0  1 99  
>>> 0  0
>>>  3  0      0 739088 142536 973536    0    0     0     0  126  169  1  2 97  
>>> 0  0
>>>  0  0      0 737336 142536 973544    0    0     0   168  721  796 35 14 50  
>>> 1  0
>>>  1  0      0 736964 142536 973544    0    0     0     0  109  160  1  1 98  
>>> 0  0
>>>  
>>> and just to make it a little simpler i ran sar 1 10 as well as it is more 
>>> human readable.
>>>  
>>> 10:27:55        CPU     %user     %nice   %system   %iowait    %steal     
>>> %idle
>>> 10:27:56        all      1.01      0.00      1.01      0.00      0.00     
>>> 97.98
>>> 10:27:57        all      0.00      0.00      1.00      0.00      0.00     
>>> 99.00
>>> 10:27:58        all      3.96      0.00      3.96      0.00      0.00     
>>> 92.08
>>> 10:27:59        all     32.00      0.00     12.00      1.00      0.00     
>>> 55.00
>>>  
>>> Something struck me as odd while testing this yesterdays results reporting 
>>> 50% system usage from 15.84% actual means the reported usage is 3.2 times 
>>> the actual. todays reported user usage of 100% is 3.2 times the actual 32%. 
>>> so it seems just need to work out why it is multiplying the results.
>>>  
>>> Regards
>>>  
>>> Wayne
>>> 
>>> On 7 December 2011 11:43, Lawrence, Wayne <[email protected]> 
>>> wrote:
>>> Hi Martin,
>>>  
>>> I downloaded the source from the Monit website and compiled it on the 
>>> server.
>>> I have started monit in verbose mode and this is the relevant information 
>>> it outputs when the event occurs.
>>>  
>>>  cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
>>> -------------------------------------------------------------------------------
>>>    ../tools/bin/monit() [0x41a533]
>>>  ../tools/bin/monit(LogError+0x9f) [0x41ad2f]
>>>    ../tools/bin/monit(Event_post+0x328) [0x417ba8]
>>>     ..t/tools/bin/monit() [0x428071]
>>>     ../tools/bin/monit(check_system+0x2b) [0x4285bb]
>>>     ../tools/bin/monit(validate+0x226) [0x42ad16]
>>>    ../tools/bin/monit() [0x41422d]
>>>     ../tools/bin/monit(main+0x511) [0x4149e1]
>>>     /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd]
>>>     ../tools/bin/monit() [0x40b179]
>>> -------------------------------------------------------------------------------
>>> Unfortunately remote access is not an option but I will happily run a debug 
>>> version to try and track down this problem as I really would like to use 
>>> Monit for my current build.
>>>  
>>> Regards
>>>  
>>> Wayne
>>> On 7 December 2011 11:17, Martin Pala <[email protected]> wrote:
>>> Thanks for data.
>>> 
>>> The /proc/stat format is this: 
>>> 
>>>     cpu <user> <nice> <system> <idle> <wait> <irq> <softirq>
>>> 
>>> The values count the cpu cycles, so if we subtract the corresponding values 
>>> from your output, we get this:
>>> 
>>>                    user   nice   system   idle   wait   irq   softirq   |   
>>> total
>>> 09:57:35    1         0        1              99     0       0      0       
>>>    |    101
>>> 09:57:36    1         0        0              98     0       0      0       
>>>    |    99
>>> 09:57:37    25       0        16           59     1       0      0          
>>> |    101
>>> 09:57:38    1         0        2              98     0       0      0       
>>>    |    101
>>> 
>>> => at  09:57:37 the cpu usage was:
>>> 
>>> user      = 24.75%
>>> system =  15.84%
>>> wait      =   0.99%
>>> 
>>> This corresponds to the previous vmstat output. Monit counts the cpu usage 
>>> the same way as above and doesn't modify these values => your monit really 
>>> reports strange cpu usage (reported 50% vs. real ~ 16%).
>>> 
>>> What's the origin of your monit binary? Did you compile it from original 
>>> source code or some 3rd party source code distibution? (such as RHEL or 
>>> Fedora repository). Or do you use the pre-compiled binaries from 
>>> www.mmonit.com? Or some 3rd party binary, patches or source code from other 
>>> site?
>>> 
>>> Please can you try to run monit in verbose mode and provide full output?:
>>> 
>>>    1.) stop monit
>>>    2.) run monit in foreground with verbose mode enabled:
>>>        ./monit -vI
>>>    3.) after the problem happens, stop monit with "^C" and send output
>>> 
>>> I can also prepare debug version which will dump the cpu usage related 
>>> informations or if you can provide remote access to the system, i can 
>>> troubleshoot the problem remotely.
>>> 
>>> 
>>> Regards,
>>> Martin
>>> 
>>> 
>>> 
>>> On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote:
>>> 
>>>> Hi Martin,
>>>>  
>>>> this is the output of the commands you requested.
>>>>  
>>>> 1.) uname -m
>>>>  
>>>> x86_64
>>>>  
>>>> 2.) file `which monit`
>>>>  
>>>>  ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically 
>>>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
>>>> I ran the command you supplied to get the cup usage directly as well while 
>>>> restarting the httpd service as i know this will generate an alert.
>>>>  
>>>>  
>>>>       Date:        Wed, 07 Dec 2011 09:57:37
>>>>       Action:      exec
>>>>       Host:        <hostname removed>
>>>>       Description: cpu system usage of 50.0% matches resource limit [cpu 
>>>> system usage>30.0%]
>>>> 
>>>> Wed Dec  7 09:57:34 GMT 2011
>>>> cpu  207060 501 103542 49452254 25303 83 1569 0 0
>>>> Wed Dec  7 09:57:35 GMT 2011
>>>> cpu  207061 501 103543 49452353 25303 83 1569 0 0
>>>> Wed Dec  7 09:57:36 GMT 2011
>>>> cpu  207062 501 103543 49452451 25303 83 1569 0 0
>>>> Wed Dec  7 09:57:37 GMT 2011
>>>> cpu  207087 501 103559 49452510 25304 83 1569 0 0
>>>> Wed Dec  7 09:57:38 GMT 2011
>>>> cpu  207088 501 103561 49452608 25304 83 1569 0 0
>>>> Wed Dec  7 09:57:40 GMT 2011
>>>> If my understanding of /proc/stat is coreect this still doesnt make any 
>>>> sense but i may be wrong.
>>>>  
>>>> Regards
>>>>  
>>>> Wayne
>>>>  
>>>> 
>>>>  
>>>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote:
>>>> Please can you check that your monit binary matches the system 
>>>> architecture? (i.e. for example 64-bit monit binary on 64-bit system - not 
>>>> 32-bit monit on 64-bit system) 
>>>> 
>>>> To verify provide please the output of following commands:
>>>> 1.) uname -m
>>>> 2.) file `which monit`
>>>> 
>>>> Monit takes the statistics from the /proc/stat kernel interface. You can 
>>>> collect the statistics manually like this - for example to fetch the state 
>>>> in 1 second intervals (30 samples):
>>>> 
>>>> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; done
>>>> 
>>>> Note: monit takes the first /proc/stat line ("cpu") which contains the 
>>>> overall cpu usage in the system (summary of all cpus). The /proc/stat also 
>>>> contains per-cpu statistics if you want to collect all the statistics, 
>>>> replace the "grep 'cpu '" simply with "cat".
>>>> 
>>>> Regards,
>>>> Martin
>>>> 
>>>> 
>>>> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote:
>>>> 
>>>>> Hi Martin,
>>>>>  
>>>>> I have tried various methods to dientify the cause of this and took your 
>>>>> advice and used vmstat. I simply restarted the httpd process from the 
>>>>> monit web interface while the comand was running and got the following 
>>>>> warning.
>>>>>  
>>>>>       Description: cpu system usage of 50.0% matches resource limit [cpu 
>>>>> system usage>30.0%]
>>>>>  
>>>>> But vmstat doesnt show that level of usage at the point of alert. As you 
>>>>> can see there is some usage in the 3rd line of the output when i 
>>>>> restarted the httpd service but it doesnt seem enough to trigger an alert.
>>>>>  
>>>>> vmstat 1 10
>>>>> procs -----------memory---------- ---swap-- -----io---- --system-- 
>>>>> -----cpu-----
>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
>>>>> id wa st
>>>>>  0  0      0 859596 114684 856908    0    0     4     6   81   77  0  0 
>>>>> 99  0  0
>>>>>  0  0      0 859448 114684 856916    0    0     0     0  100   94  1  0 
>>>>> 99  0  0
>>>>>  0  0      0 898352 114692 815600    0    0     0   168  555  605 23 15 
>>>>> 61  1  0
>>>>>  
>>>>> Not sure if there are any other tests i can run to narrow this down a bit 
>>>>> further as it still isn't making sense.
>>>>>  
>>>>> Regards
>>>>>  
>>>>> Wayne
>>>>>  
>>>>>  
>>>>> 
>>>>> 
>>>>>  
>>>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote:
>>>>> Hi Lawrence,
>>>>> 
>>>>> the test which triggers the alert is "system" cpu => it's the time the 
>>>>> system spend in kernel mode. The cpu usage could be triggered by some 
>>>>> background kernel task, to verify the monit report matches the system cpu 
>>>>> usage, you should use either "vmstat" or "top" instead of "ps".
>>>>> 
>>>>> Best regards,
>>>>> Martin 
>>>>> 
>>>>> 
>>>>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote:
>>>>> 
>>>>>> Hi Igor,
>>>>>>  
>>>>>> the operating system is RHEL6 and monit version is 5.3.1
>>>>>>  
>>>>>> this is what i have in my config
>>>>>>  
>>>>>>     if cpu usage (user) > 70% then alert
>>>>>>     if cpu usage (system) > 30% then alert
>>>>>>     if cpu usage (wait) > 20% then alert
>>>>>> 
>>>>>> this is one of the errors
>>>>>> Description: cpu system usage of 50.0% matches resource limit [cpu 
>>>>>> system usage>30.0%]
>>>>>>  
>>>>>> this is what i get in /var/log/messages
>>>>>> Dec  6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu 
>>>>>> system usage of 50.0% matches resource limit [cpu system usage>30.0%]
>>>>>> Dec  6 12:02:29 <hostname-removed> monit[864]: 
>>>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded 
>>>>>> [current cpu system usage=0.9%]
>>>>>>  
>>>>>> this is the output of ps --no-headers -A -o "%cpu sz ucomm" | sort -k1nr 
>>>>>> | head -20
>>>>>>  
>>>>>>  12:01:29 up 4 days, 20:24,  2 users,  load average: 0.04, 0.01, 0.00
>>>>>>              total       used       free     shared    buffers     cached
>>>>>> Mem:       2055108    1092176     962932          0      53156     811864
>>>>>> -/+ buffers/cache:     227156    1827952
>>>>>> Swap:      4128760          0    4128760
>>>>>>  1.2 44308 perl
>>>>>>  0.0     0 aio/0
>>>>>>  0.0     0 async/mgr
>>>>>>  0.0     0 ata/0
>>>>>>  0.0     0 ata_aux
>>>>>>  0.0     0 bdi-default
>>>>>>  0.0     0 cpuset
>>>>>>  0.0     0 crypto/0
>>>>>>  0.0     0 events/0
>>>>>>  0.0     0 ext4-dio-unwrit
>>>>>>  0.0     0 flush-253:0
>>>>>>  0.0     0 jbd2/dm-0-8
>>>>>>  0.0     0 kacpi_hotplug
>>>>>>  0.0     0 kacpi_notify
>>>>>>  0.0     0 kacpid
>>>>>>  0.0     0 kauditd
>>>>>>  0.0     0 kblockd/0
>>>>>>  0.0     0 kdmflush
>>>>>>  0.0     0 khelper
>>>>>>  0.0     0 khubd
>>>>>> 
>>>>>> Have to say i am at a total loss as there is no way the usage figures 
>>>>>> are accurate.
>>>>>> If there is any other info i can supply that will be useful please let 
>>>>>> me know.
>>>>>>  
>>>>>> Regards
>>>>>>  
>>>>>> Wayne
>>>>>> 
>>>>>> 
>>>>>> On 6 December 2011 12:03, Igor Homyakov 
>>>>>> <[email protected]> wrote:
>>>>>> Hi Lawrence,
>>>>>> 
>>>>>> Could you be a little bit more specific ?  Please provide information
>>>>>> about you operation system, monit version on which the problem
>>>>>> occurred and so on.
>>>>>> 
>>>>>> Regards
>>>>>> Igor Homyakov
>>>>>> 
>>>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne
>>>>>> <[email protected]> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I have a few CPU usage checks in my monitrc but it seems monit is
>>>>>> > misreporting the usage.
>>>>>> >
>>>>>> > I have run several tests and it seems that monit is multiplying the 
>>>>>> > actual
>>>>>> > usage by 10.
>>>>>> >
>>>>>> > I ran a process with top running in another shell and CPU usage for 
>>>>>> > the user
>>>>>> > was never above 10% yet monit informed me that there was 100% cpu 
>>>>>> > usage.
>>>>>> >
>>>>>> > I have tried various configurations including the one that came with 
>>>>>> > the
>>>>>> > default config for system cpu monitoring and all seem to demonstrate 
>>>>>> > the
>>>>>> > same issue.
>>>>>> >
>>>>>> > Any advice welcomed on this
>>>>>> >
>>>>>> > Regards
>>>>>> >
>>>>>> > Wayne Lawrence
>>> 
>>> 
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>>> 
>>> 
>>> --
>>> To unsubscribe:
>>> https://lists.nongnu.org/mailman/listinfo/monit-general
>> 
>> 
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
>> 
>> --
>> To unsubscribe:
>> https://lists.nongnu.org/mailman/listinfo/monit-general
> 
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: CPU usage in Monit

Reply via email to