Hello,
At my Company, we are testing both gearmand and shinken for our next monitoring
infrastructure.
We are facing some problems with shinken pollers, lots of checks are ending in
defunct process. (via nagios perl plugins, both officials and of our own)
Sometimes we have up to 800 zombies at a time.
It seems like, the zombies are noticed as snmp_timeout in the log.
I tried different values for service_chek_timeout and host_check_timeout,
without success.
Actually, both values are set to 60.
The same plugins are used by nagios and gearmand, and show no problem.
We are checking 16000 services with one poller. The same poller is used for
shinken or nagios / gearmand (not at the same time of course ;)
16000 checks show no problem with nagios / germand.
The poller is a 8 cores / 16 MT cores with 12 Go RAM.
We have another physical server as arbiter, broker (ndo and NPCD), receiver and
reactionner ; and some VMs (from 1 up to 3 ) for schedulers.
Any idea about the zombies ?
Regards
------------------------------------------------------------------------------
BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA
The must-attend event for mobile developers. Connect with experts.
Get tools for creating Super Apps. See the latest technologies.
Sessions, hands-on labs, demos & much more. Register early & save!
http://p.sf.net/sfu/rim-blackberry-1
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel