This one time, at band camp, Craig Dibble wrote: >...right up until I deployed our new LDAP servers to production. Now I >find that I get intermittent failures from the keepalive script whereby >it reports that some or all of the processes it is monitoring have died, >tries to restart them, and fails.
Immediately I am thinking that the problem is somewhere in NSS. Timeouts due to LDAP connection overheads, fd leaks in nss_ldap, nscd's very existence, all could be causing something to fail. Unlike Solaris, POSIX and Linux don't cater to temporary failure, so anything that explodes in the pipeline is going to return a failed lookup (and if you're using nscd, it'll cache that negative if you're really unlucky.) >[1] As a temporary fix I have put a simple hook in the keepalive script >to die if the returned process list is empty. This works ok for the >moment but without knowing why it is behaving like this I can't help but >feel all I might be doing is putting a bandaid on a bigger problem. Do you get an empty process list when you run it by hand? Is there a timeout on the process list command in the keepalive script? The first thing to try is to replicate the conditions in the script to get a repeatable failure of ps. Once you've done that, you'll have some idea as to where to look next. -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html