This one time, at band camp, Craig Dibble wrote:
>...right up until I deployed our new LDAP servers to production. Now I
>find that I get intermittent failures from the keepalive script whereby
>it reports that some or all of the processes it is monitoring have died,
>tries to restart them, and fails.

Immediately I am thinking that the problem is somewhere in NSS.  Timeouts
due to LDAP connection overheads, fd leaks in nss_ldap, nscd's very
existence, all could be causing something to fail.

Unlike Solaris, POSIX and Linux don't cater to temporary failure, so
anything that explodes in the pipeline is going to return a failed lookup
(and if you're using nscd, it'll cache that negative if you're really
unlucky.)

>[1] As a temporary fix I have put a simple hook in the keepalive script
>to die if the returned process list is empty. This works ok for the
>moment but without knowing why it is behaving like this I can't help but
>feel all I might be doing is putting a bandaid on a bigger problem.

Do you get an empty process list when you run it by hand?

Is there a timeout on the process list command in the keepalive script?

The first thing to try is to replicate the conditions in the script to get a
repeatable failure of ps.  Once you've done that, you'll have some idea as
to where to look next.
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to