Hi, thanks for update. I have prepared the debug version, which logs the values computed based on /proc/stat right when they are ready and once again before the values are checked, so we can see whether the values were read+computed correctly and whether no memory corruption occurred before they were compared by the validation engine => there are two "CPUDEBUG" log entries per cycle.
You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz To compile: tar -xzf monit-5.3.1p2.tar.gz cd monit-5.3.1p2 ./configure make Then stop existing monit instance and run new monit binary: ./monit -vI 2>&1 | grep CPUDEBUG after you'll replicate the problem, terminate monit with ^C and send the whole CPUDEBUG output since monit start Regards,, Martin On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote: > Hi Martin just as a side note here i disabled the cpu ssystem test and tried > again and it seems that the issue is present with all the cpu monitoring/ > > I used the restarting of httpd as i knew it would trigger and alert and these > were the results. > > Date: Thu, 08 Dec 2011 10:27:59 > Action: alert > Host: <hostname removed> > Description: cpu user usage of 100.0% matches resource limit [cpu user > usage>70.0%] > > I ran vmstat 1 10 at the same time as you can see its the 4th line. > > > procs -----------memory---------- ---swap-- -----io---- --system-- > -----cpu----- > r b swpd free buff cache si so bi bo in cs us sy id wa > st > 0 0 0 739220 142536 973532 0 0 4 7 10 6 0 0 99 0 > 0 > 0 0 0 739088 142536 973532 0 0 0 0 114 160 0 1 99 0 > 0 > 3 0 0 739088 142536 973536 0 0 0 0 126 169 1 2 97 0 > 0 > 0 0 0 737336 142536 973544 0 0 0 168 721 796 35 14 50 1 > 0 > 1 0 0 736964 142536 973544 0 0 0 0 109 160 1 1 98 0 > 0 > > and just to make it a little simpler i ran sar 1 10 as well as it is more > human readable. > > 10:27:55 CPU %user %nice %system %iowait %steal > %idle > 10:27:56 all 1.01 0.00 1.01 0.00 0.00 > 97.98 > 10:27:57 all 0.00 0.00 1.00 0.00 0.00 > 99.00 > 10:27:58 all 3.96 0.00 3.96 0.00 0.00 > 92.08 > 10:27:59 all 32.00 0.00 12.00 1.00 0.00 > 55.00 > > Something struck me as odd while testing this yesterdays results reporting > 50% system usage from 15.84% actual means the reported usage is 3.2 times the > actual. todays reported user usage of 100% is 3.2 times the actual 32%. so it > seems just need to work out why it is multiplying the results. > > Regards > > Wayne > > On 7 December 2011 11:43, Lawrence, Wayne <[email protected]> wrote: > Hi Martin, > > I downloaded the source from the Monit website and compiled it on the server. > I have started monit in verbose mode and this is the relevant information it > outputs when the event occurs. > > cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%] > ------------------------------------------------------------------------------- > ../tools/bin/monit() [0x41a533] > ../tools/bin/monit(LogError+0x9f) [0x41ad2f] > ../tools/bin/monit(Event_post+0x328) [0x417ba8] > ..t/tools/bin/monit() [0x428071] > ../tools/bin/monit(check_system+0x2b) [0x4285bb] > ../tools/bin/monit(validate+0x226) [0x42ad16] > ../tools/bin/monit() [0x41422d] > ../tools/bin/monit(main+0x511) [0x4149e1] > /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd] > ../tools/bin/monit() [0x40b179] > ------------------------------------------------------------------------------- > Unfortunately remote access is not an option but I will happily run a debug > version to try and track down this problem as I really would like to use > Monit for my current build. > > Regards > > Wayne > On 7 December 2011 11:17, Martin Pala <[email protected]> wrote: > Thanks for data. > > The /proc/stat format is this: > > cpu <user> <nice> <system> <idle> <wait> <irq> <softirq> > > The values count the cpu cycles, so if we subtract the corresponding values > from your output, we get this: > > user nice system idle wait irq softirq | > total > 09:57:35 1 0 1 99 0 0 0 > | 101 > 09:57:36 1 0 0 98 0 0 0 > | 99 > 09:57:37 25 0 16 59 1 0 0 | > 101 > 09:57:38 1 0 2 98 0 0 0 > | 101 > > => at 09:57:37 the cpu usage was: > > user = 24.75% > system = 15.84% > wait = 0.99% > > This corresponds to the previous vmstat output. Monit counts the cpu usage > the same way as above and doesn't modify these values => your monit really > reports strange cpu usage (reported 50% vs. real ~ 16%). > > What's the origin of your monit binary? Did you compile it from original > source code or some 3rd party source code distibution? (such as RHEL or > Fedora repository). Or do you use the pre-compiled binaries from > www.mmonit.com? Or some 3rd party binary, patches or source code from other > site? > > Please can you try to run monit in verbose mode and provide full output?: > > 1.) stop monit > 2.) run monit in foreground with verbose mode enabled: > ./monit -vI > 3.) after the problem happens, stop monit with "^C" and send output > > I can also prepare debug version which will dump the cpu usage related > informations or if you can provide remote access to the system, i can > troubleshoot the problem remotely. > > > Regards, > Martin > > > > On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote: > >> Hi Martin, >> >> this is the output of the commands you requested. >> >> 1.) uname -m >> >> x86_64 >> >> 2.) file `which monit` >> >> ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically >> linked (uses shared libs), for GNU/Linux 2.6.18, not stripped >> I ran the command you supplied to get the cup usage directly as well while >> restarting the httpd service as i know this will generate an alert. >> >> >> Date: Wed, 07 Dec 2011 09:57:37 >> Action: exec >> Host: <hostname removed> >> Description: cpu system usage of 50.0% matches resource limit [cpu >> system usage>30.0%] >> >> Wed Dec 7 09:57:34 GMT 2011 >> cpu 207060 501 103542 49452254 25303 83 1569 0 0 >> Wed Dec 7 09:57:35 GMT 2011 >> cpu 207061 501 103543 49452353 25303 83 1569 0 0 >> Wed Dec 7 09:57:36 GMT 2011 >> cpu 207062 501 103543 49452451 25303 83 1569 0 0 >> Wed Dec 7 09:57:37 GMT 2011 >> cpu 207087 501 103559 49452510 25304 83 1569 0 0 >> Wed Dec 7 09:57:38 GMT 2011 >> cpu 207088 501 103561 49452608 25304 83 1569 0 0 >> Wed Dec 7 09:57:40 GMT 2011 >> If my understanding of /proc/stat is coreect this still doesnt make any >> sense but i may be wrong. >> >> Regards >> >> Wayne >> >> >> >> On 7 December 2011 09:37, Martin Pala <[email protected]> wrote: >> Please can you check that your monit binary matches the system architecture? >> (i.e. for example 64-bit monit binary on 64-bit system - not 32-bit monit on >> 64-bit system) >> >> To verify provide please the output of following commands: >> 1.) uname -m >> 2.) file `which monit` >> >> Monit takes the statistics from the /proc/stat kernel interface. You can >> collect the statistics manually like this - for example to fetch the state >> in 1 second intervals (30 samples): >> >> $ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; done >> >> Note: monit takes the first /proc/stat line ("cpu") which contains the >> overall cpu usage in the system (summary of all cpus). The /proc/stat also >> contains per-cpu statistics if you want to collect all the statistics, >> replace the "grep 'cpu '" simply with "cat". >> >> Regards, >> Martin >> >> >> On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote: >> >>> Hi Martin, >>> >>> I have tried various methods to dientify the cause of this and took your >>> advice and used vmstat. I simply restarted the httpd process from the monit >>> web interface while the comand was running and got the following warning. >>> >>> Description: cpu system usage of 50.0% matches resource limit [cpu >>> system usage>30.0%] >>> >>> But vmstat doesnt show that level of usage at the point of alert. As you >>> can see there is some usage in the 3rd line of the output when i restarted >>> the httpd service but it doesnt seem enough to trigger an alert. >>> >>> vmstat 1 10 >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> -----cpu----- >>> r b swpd free buff cache si so bi bo in cs us sy id >>> wa st >>> 0 0 0 859596 114684 856908 0 0 4 6 81 77 0 0 99 >>> 0 0 >>> 0 0 0 859448 114684 856916 0 0 0 0 100 94 1 0 99 >>> 0 0 >>> 0 0 0 898352 114692 815600 0 0 0 168 555 605 23 15 61 >>> 1 0 >>> >>> Not sure if there are any other tests i can run to narrow this down a bit >>> further as it still isn't making sense. >>> >>> Regards >>> >>> Wayne >>> >>> >>> >>> >>> >>> On 7 December 2011 08:27, Martin Pala <[email protected]> wrote: >>> Hi Lawrence, >>> >>> the test which triggers the alert is "system" cpu => it's the time the >>> system spend in kernel mode. The cpu usage could be triggered by some >>> background kernel task, to verify the monit report matches the system cpu >>> usage, you should use either "vmstat" or "top" instead of "ps". >>> >>> Best regards, >>> Martin >>> >>> >>> On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote: >>> >>>> Hi Igor, >>>> >>>> the operating system is RHEL6 and monit version is 5.3.1 >>>> >>>> this is what i have in my config >>>> >>>> if cpu usage (user) > 70% then alert >>>> if cpu usage (system) > 30% then alert >>>> if cpu usage (wait) > 20% then alert >>>> >>>> this is one of the errors >>>> Description: cpu system usage of 50.0% matches resource limit [cpu system >>>> usage>30.0%] >>>> >>>> this is what i get in /var/log/messages >>>> Dec 6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu >>>> system usage of 50.0% matches resource limit [cpu system usage>30.0%] >>>> Dec 6 12:02:29 <hostname-removed> monit[864]: >>>> <hostname-removed><hostname-removed>' cpu system usage check succeeded >>>> [current cpu system usage=0.9%] >>>> >>>> this is the output of ps --no-headers -A -o "%cpu sz ucomm" | sort -k1nr | >>>> head -20 >>>> >>>> 12:01:29 up 4 days, 20:24, 2 users, load average: 0.04, 0.01, 0.00 >>>> total used free shared buffers cached >>>> Mem: 2055108 1092176 962932 0 53156 811864 >>>> -/+ buffers/cache: 227156 1827952 >>>> Swap: 4128760 0 4128760 >>>> 1.2 44308 perl >>>> 0.0 0 aio/0 >>>> 0.0 0 async/mgr >>>> 0.0 0 ata/0 >>>> 0.0 0 ata_aux >>>> 0.0 0 bdi-default >>>> 0.0 0 cpuset >>>> 0.0 0 crypto/0 >>>> 0.0 0 events/0 >>>> 0.0 0 ext4-dio-unwrit >>>> 0.0 0 flush-253:0 >>>> 0.0 0 jbd2/dm-0-8 >>>> 0.0 0 kacpi_hotplug >>>> 0.0 0 kacpi_notify >>>> 0.0 0 kacpid >>>> 0.0 0 kauditd >>>> 0.0 0 kblockd/0 >>>> 0.0 0 kdmflush >>>> 0.0 0 khelper >>>> 0.0 0 khubd >>>> >>>> Have to say i am at a total loss as there is no way the usage figures are >>>> accurate. >>>> If there is any other info i can supply that will be useful please let me >>>> know. >>>> >>>> Regards >>>> >>>> Wayne >>>> >>>> >>>> On 6 December 2011 12:03, Igor Homyakov <[email protected]> >>>> wrote: >>>> Hi Lawrence, >>>> >>>> Could you be a little bit more specific ? Please provide information >>>> about you operation system, monit version on which the problem >>>> occurred and so on. >>>> >>>> Regards >>>> Igor Homyakov >>>> >>>> On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne >>>> <[email protected]> wrote: >>>> > Hi, >>>> > >>>> > I have a few CPU usage checks in my monitrc but it seems monit is >>>> > misreporting the usage. >>>> > >>>> > I have run several tests and it seems that monit is multiplying the >>>> > actual >>>> > usage by 10. >>>> > >>>> > I ran a process with top running in another shell and CPU usage for the >>>> > user >>>> > was never above 10% yet monit informed me that there was 100% cpu usage. >>>> > >>>> > I have tried various configurations including the one that came with the >>>> > default config for system cpu monitoring and all seem to demonstrate the >>>> > same issue. >>>> > >>>> > Any advice welcomed on this >>>> > >>>> > Regards >>>> > >>>> > Wayne Lawrence > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general > > > -- > To unsubscribe: > https://lists.nongnu.org/mailman/listinfo/monit-general
-- To unsubscribe: https://lists.nongnu.org/mailman/listinfo/monit-general
