Re: CPU usage in Monit

Martin Pala Thu, 08 Dec 2011 04:50:56 -0800

Hi,

thanks for update. I have prepared the debug version, which logs the values 
computed based on /proc/stat right when they are ready and once again before 
the values are checked, so we can see whether the values were read+computed 
correctly and whether no memory corruption occurred before they were compared 
by the validation engine => there are two "CPUDEBUG" log entries per cycle.


You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz

To compile:
        tar -xzf monit-5.3.1p2.tar.gz
        cd monit-5.3.1p2
        ./configure
        make

Then stop existing monit instance and run new monit binary:
        ./monit -vI  2>&1 | grep CPUDEBUG

after you'll replicate the problem, terminate monit with ^C and send the whole 
CPUDEBUG output since monit start

Regards,,
Martin


On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote:

> Hi Martin just as a side note here i disabled the cpu ssystem test and tried 
> again and it seems that the issue is present with all the cpu monitoring/
>  
> I used the restarting of httpd as i knew it would trigger and alert and these 
> were the results.
>  
> Date:        Thu, 08 Dec 2011 10:27:59
>       Action:      alert
>       Host:        <hostname removed>
>       Description: cpu user usage of 100.0% matches resource limit [cpu user 
> usage>70.0%]
>  
> I ran vmstat 1 10 at the same time as you can see its the 4th line.
>  
> 
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> -----cpu-----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
> st
>  0  0      0 739220 142536 973532    0    0     4     7   10    6  0  0 99  0 
>  0
>  0  0      0 739088 142536 973532    0    0     0     0  114  160  0  1 99  0 
>  0
>  3  0      0 739088 142536 973536    0    0     0     0  126  169  1  2 97  0 
>  0
>  0  0      0 737336 142536 973544    0    0     0   168  721  796 35 14 50  1 
>  0
>  1  0      0 736964 142536 973544    0    0     0     0  109  160  1  1 98  0 
>  0
>  
> and just to make it a little simpler i ran sar 1 10 as well as it is more 
> human readable.
>  
> 10:27:55        CPU     %user     %nice   %system   %iowait    %steal     
> %idle
> 10:27:56        all      1.01      0.00      1.01      0.00      0.00     
> 97.98
> 10:27:57        all      0.00      0.00      1.00      0.00      0.00     
> 99.00
> 10:27:58        all      3.96      0.00      3.96      0.00      0.00     
> 92.08
> 10:27:59        all     32.00      0.00     12.00      1.00      0.00     
> 55.00
>  
> Something struck me as odd while testing this yesterdays results reporting 
> 50% system usage from 15.84% actual means the reported usage is 3.2 times the 
> actual. todays reported user usage of 100% is 3.2 times the actual 32%. so it 
> seems just need to work out why it is multiplying the results.
>  
> Regards
>  
> Wayne
> 
> On 7 December 2011 11:43, Lawrence, Wayne <[email protected]> wrote:
> Hi Martin,
>  
> I downloaded the source from the Monit website and compiled it on the server.
> I have started monit in verbose mode and this is the relevant information it 
> outputs when the event occurs.
>  
>  cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
> -------------------------------------------------------------------------------
>    ../tools/bin/monit() [0x41a533]
>  ../tools/bin/monit(LogError+0x9f) [0x41ad2f]
>    ../tools/bin/monit(Event_post+0x328) [0x417ba8]
>     ..t/tools/bin/monit() [0x428071]
>     ../tools/bin/monit(check_system+0x2b) [0x4285bb]
>     ../tools/bin/monit(validate+0x226) [0x42ad16]
>    ../tools/bin/monit() [0x41422d]
>     ../tools/bin/monit(main+0x511) [0x4149e1]
>     /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd]
>     ../tools/bin/monit() [0x40b179]
> -------------------------------------------------------------------------------
> Unfortunately remote access is not an option but I will happily run a debug 
> version to try and track down this problem as I really would like to use 
> Monit for my current build.
>  
> Regards
>  
> Wayne
> On 7 December 2011 11:17, Martin Pala <[email protected]> wrote:
> Thanks for data.
> 
> The /proc/stat format is this: 
> 
>     cpu <user> <nice> <system> <idle> <wait> <irq> <softirq>
> 
> The values count the cpu cycles, so if we subtract the corresponding values 
> from your output, we get this:
> 
>                    user   nice   system   idle   wait   irq   softirq   |   
> total
> 09:57:35    1         0        1              99     0       0      0         
>  |    101
> 09:57:36    1         0        0              98     0       0      0         
>  |    99
> 09:57:37    25       0        16           59     1       0      0          | 
>    101
> 09:57:38    1         0        2              98     0       0      0         
>  |    101
> 
> => at  09:57:37 the cpu usage was:
> 
> user      = 24.75%
> system =  15.84%
> wait      =   0.99%
> 
> This corresponds to the previous vmstat output. Monit counts the cpu usage 
> the same way as above and doesn't modify these values => your monit really 
> reports strange cpu usage (reported 50% vs. real ~ 16%).
> 
> What's the origin of your monit binary? Did you compile it from original 
> source code or some 3rd party source code distibution? (such as RHEL or 
> Fedora repository). Or do you use the pre-compiled binaries from 
> www.mmonit.com? Or some 3rd party binary, patches or source code from other 
> site?
> 
> Please can you try to run monit in verbose mode and provide full output?:
> 
>    1.) stop monit
>    2.) run monit in foreground with verbose mode enabled:
>        ./monit -vI
>    3.) after the problem happens, stop monit with "^C" and send output
> 
> I can also prepare debug version which will dump the cpu usage related 
> informations or if you can provide remote access to the system, i can 
> troubleshoot the problem remotely.
> 
> 
> Regards,
> Martin
> 
> 
> 
> On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote:
> 
>> Hi Martin,
>>  
>> this is the output of the commands you requested.
>>  
>> 1.) uname -m
>>  
>> x86_64
>>  
>> 2.) file `which monit`
>>  
>>  ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically 
>> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
>> I ran the command you supplied to get the cup usage directly as well while 
>> restarting the httpd service as i know this will generate an alert.
>>  
>>  
>>       Date:        Wed, 07 Dec 2011 09:57:37
>>       Action:      exec
>>       Host:        <hostname removed>
>>       Description: cpu system usage of 50.0% matches resource limit [cpu 
>> system usage>30.0%]
>> 
>> Wed Dec  7 09:57:34 GMT 2011
>> cpu  207060 501 103542 49452254 25303 83 1569 0 0
>> Wed Dec  7 09:57:35 GMT 2011
>> cpu  207061 501 103543 49452353 25303 83 1569 0 0
>> Wed Dec  7 09:57:36 GMT 2011
>> cpu  207062 501 103543 49452451 25303 83 1569 0 0
>> Wed Dec  7 09:57:37 GMT 2011
>> cpu  207087 501 103559 49452510 25304 83 1569 0 0
>> Wed Dec  7 09:57:38 GMT 2011
>> cpu  207088 501 103561 49452608 25304 83 1569 0 0
>> Wed Dec  7 09:57:40 GMT 2011
>> If my understanding of /proc/stat is coreect this still doesnt make any 
>> sense but i may be wrong.
>>  
>> Regards
>>  
>> Wayne
>>  
>> 
>>  
>> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote:
>> Please can you check that your monit binary matches the system architecture? 
>> (i.e. for example 64-bit monit binary on 64-bit system - not 32-bit monit on 
>> 64-bit system) 
>> 
>> To verify provide please the output of following commands:
>> 1.) uname -m
>> 2.) file `which monit`
>> 
>> Monit takes the statistics from the /proc/stat kernel interface. You can 
>> collect the statistics manually like this - for example to fetch the state 
>> in 1 second intervals (30 samples):
>> 
>> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; done
>> 
>> Note: monit takes the first /proc/stat line ("cpu") which contains the 
>> overall cpu usage in the system (summary of all cpus). The /proc/stat also 
>> contains per-cpu statistics if you want to collect all the statistics, 
>> replace the "grep 'cpu '" simply with "cat".
>> 
>> Regards,
>> Martin
>> 
>> 
>> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote:
>> 
>>> Hi Martin,
>>>  
>>> I have tried various methods to dientify the cause of this and took your 
>>> advice and used vmstat. I simply restarted the httpd process from the monit 
>>> web interface while the comand was running and got the following warning.
>>>  
>>>       Description: cpu system usage of 50.0% matches resource limit [cpu 
>>> system usage>30.0%]
>>>  
>>> But vmstat doesnt show that level of usage at the point of alert. As you 
>>> can see there is some usage in the 3rd line of the output when i restarted 
>>> the httpd service but it doesnt seem enough to trigger an alert.
>>>  
>>> vmstat 1 10
>>> procs -----------memory---------- ---swap-- -----io---- --system-- 
>>> -----cpu-----
>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id 
>>> wa st
>>>  0  0      0 859596 114684 856908    0    0     4     6   81   77  0  0 99  
>>> 0  0
>>>  0  0      0 859448 114684 856916    0    0     0     0  100   94  1  0 99  
>>> 0  0
>>>  0  0      0 898352 114692 815600    0    0     0   168  555  605 23 15 61  
>>> 1  0
>>>  
>>> Not sure if there are any other tests i can run to narrow this down a bit 
>>> further as it still isn't making sense.
>>>  
>>> Regards
>>>  
>>> Wayne
>>>  
>>>  
>>> 
>>> 
>>>  
>>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote:
>>> Hi Lawrence,
>>> 
>>> the test which triggers the alert is "system" cpu => it's the time the 
>>> system spend in kernel mode. The cpu usage could be triggered by some 
>>> background kernel task, to verify the monit report matches the system cpu 
>>> usage, you should use either "vmstat" or "top" instead of "ps".
>>> 
>>> Best regards,
>>> Martin 
>>> 
>>> 
>>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote:
>>> 
>>>> Hi Igor,
>>>>  
>>>> the operating system is RHEL6 and monit version is 5.3.1
>>>>  
>>>> this is what i have in my config
>>>>  
>>>>     if cpu usage (user) > 70% then alert
>>>>     if cpu usage (system) > 30% then alert
>>>>     if cpu usage (wait) > 20% then alert
>>>> 
>>>> this is one of the errors
>>>> Description: cpu system usage of 50.0% matches resource limit [cpu system 
>>>> usage>30.0%]
>>>>  
>>>> this is what i get in /var/log/messages
>>>> Dec  6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu 
>>>> system usage of 50.0% matches resource limit [cpu system usage>30.0%]
>>>> Dec  6 12:02:29 <hostname-removed> monit[864]: 
>>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded 
>>>> [current cpu system usage=0.9%]
>>>>  
>>>> this is the output of ps --no-headers -A -o "%cpu sz ucomm" | sort -k1nr | 
>>>> head -20
>>>>  
>>>>  12:01:29 up 4 days, 20:24,  2 users,  load average: 0.04, 0.01, 0.00
>>>>              total       used       free     shared    buffers     cached
>>>> Mem:       2055108    1092176     962932          0      53156     811864
>>>> -/+ buffers/cache:     227156    1827952
>>>> Swap:      4128760          0    4128760
>>>>  1.2 44308 perl
>>>>  0.0     0 aio/0
>>>>  0.0     0 async/mgr
>>>>  0.0     0 ata/0
>>>>  0.0     0 ata_aux
>>>>  0.0     0 bdi-default
>>>>  0.0     0 cpuset
>>>>  0.0     0 crypto/0
>>>>  0.0     0 events/0
>>>>  0.0     0 ext4-dio-unwrit
>>>>  0.0     0 flush-253:0
>>>>  0.0     0 jbd2/dm-0-8
>>>>  0.0     0 kacpi_hotplug
>>>>  0.0     0 kacpi_notify
>>>>  0.0     0 kacpid
>>>>  0.0     0 kauditd
>>>>  0.0     0 kblockd/0
>>>>  0.0     0 kdmflush
>>>>  0.0     0 khelper
>>>>  0.0     0 khubd
>>>> 
>>>> Have to say i am at a total loss as there is no way the usage figures are 
>>>> accurate.
>>>> If there is any other info i can supply that will be useful please let me 
>>>> know.
>>>>  
>>>> Regards
>>>>  
>>>> Wayne
>>>> 
>>>> 
>>>> On 6 December 2011 12:03, Igor Homyakov <[email protected]> 
>>>> wrote:
>>>> Hi Lawrence,
>>>> 
>>>> Could you be a little bit more specific ?  Please provide information
>>>> about you operation system, monit version on which the problem
>>>> occurred and so on.
>>>> 
>>>> Regards
>>>> Igor Homyakov
>>>> 
>>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne
>>>> <[email protected]> wrote:
>>>> > Hi,
>>>> >
>>>> > I have a few CPU usage checks in my monitrc but it seems monit is
>>>> > misreporting the usage.
>>>> >
>>>> > I have run several tests and it seems that monit is multiplying the 
>>>> > actual
>>>> > usage by 10.
>>>> >
>>>> > I ran a process with top running in another shell and CPU usage for the 
>>>> > user
>>>> > was never above 10% yet monit informed me that there was 100% cpu usage.
>>>> >
>>>> > I have tried various configurations including the one that came with the
>>>> > default config for system cpu monitoring and all seem to demonstrate the
>>>> > same issue.
>>>> >
>>>> > Any advice welcomed on this
>>>> >
>>>> > Regards
>>>> >
>>>> > Wayne Lawrence
> 
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general
> 
> 
> --
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

Re: CPU usage in Monit

Reply via email to