Hi all,

I have another head scratcher that I hope someone can help shed some
light on:

We have a home-grown perl keepalive script that runs via cron every ten
minutes to monitor various processes on our production systems by
running a ps -ef command and comparing the output to a list, restarting
them if they stop or killing and restarting them if they run away. This
is a legacy script that has been running quite happily for several years...

...right up until I deployed our new LDAP servers to production. Now I
find that I get intermittent failures from the keepalive script whereby
it reports that some or all of the processes it is monitoring have died,
tries to restart them, and fails.

We eventually determined that the reason it is failing to restart the
processes was that they were in fact still running, and what had
actually happened was that the ps command had returned no output so the
script assumed they were all dead.

Now I know that this comes down to the fact that there is no error
handling in the script, and I could fix this quite easily[1], but what I
need to understand is why this is happening in the first place and how
LDAP could be having this effect on the script. Stop the LDAP servers
and the false warnings stop.

Initially I considered that the ps command might be timing out for some
reason when trying to look up the names of the owners of the processes,
but adding a flag to return only the UIDs made no difference.

I have read some reports of issues with nscd (which may be related to my
first supposition) but either switching it off or disabling the caches
also makes no difference.

Other information that may be useful: OS is FC4, I'm running nss_ldap,
using nisNetgroups and using ACLs for security via PAM's access module.

Finally - the problem doesn't just occur on the LDAP master and slave,
it also occurs on client systems.

Does anyone have any idea what else might be happening here as it's got
me fairly flummoxed right now?

[1] As a temporary fix I have put a simple hook in the keepalive script
to die if the returned process list is empty. This works ok for the
moment but without knowing why it is behaving like this I can't help but
feel all I might be doing is putting a bandaid on a bigger problem.

Thanks in advance for any pointers,
Craig


-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to