Hi,

we are currently chasing a strange issue at a customers site where the ldap 
slaves become unresponsive when network connectivity to master ldaps and dns 
servers is lost.

They have a setup of two masters and two slaves at separate sites.  There is a 
load balancer sitting in front of the slaves that performs regular health 
checks consisting of binds followed by a search of their binddn.

During regular operations the load balancers health checks look as follows [1]

  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 ACCEPT from 
IP=192.0.2.189:33852 (IP=192.0.2.129:389)
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND 
dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 BIND 
dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=0 RESULT tag=97 err=0 text=
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SRCH 
base="ou=system,dc=example,dc=org" scope=1 deref=0 
filter="(cn=keepalive-check-lb)"
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 ENTRY 
dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org"
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=1 SEARCH RESULT tag=101 
err=0 nentries=1 text=
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 op=2 UNBIND
  Dec  2 14:38:05 ldap slapd[57585]: connection_closing: readying conn=3924716 
sd=36 for close
  Dec  2 14:38:05 ldap slapd[57585]: connection_resched: attempting closing 
conn=3924716 sd=36
  Dec  2 14:38:05 ldap slapd[57585]: conn=3924716 fd=36 closed


When they experience a network outage separating the slaves from the masters 
and the dns servers the load balancers are not able to bind the slaves:

  Dec  2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 ACCEPT from 
IP=192.0.2.188:35761 (IP=192.0.2.129:389)
  Dec  2 14:38:50 ldap slapd[57585]: connection_closing: readying conn=3924725 
sd=44 for close
  Dec  2 14:38:50 ldap slapd[57585]: connection_close: deferring conn=3924725 
sd=44
  Dec  2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND 
dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" method=128
  Dec  2 14:38:50 ldap slapd[57585]: conn=3924725 op=0 BIND 
dn="cn=keepalive-check-lb,ou=system,dc=example,dc=org" mech=SIMPLE ssf=0
  Dec  2 14:38:50 ldap slapd[57585]: connection_resched: attempting closing 
conn=3924725 sd=44
  Dec  2 14:38:50 ldap slapd[57585]: conn=3924725 fd=44 closed (connection lost)

We have not been able to reproduce this problem in a lab setup which is 
supposed to be identical to the production setup.  It does not seem to be 
related to the servers not being able to perform reverse mapping on the client 
ips.  We run a mixture of 2.4.35 and 2.4.38 on CentOS 6.4.  In the lab the 
slaves are able to perform queries just fine without connectivity to the 
masters or to their dns servers.

The servers are currently running with following loglevel:

  dn: cn=config
  olcLogLevel: Conns
  olcLogLevel: Stats
  olcLogLevel: Stats2
  olcLogLevel: Sync

It seems we only get to the point where the bind credentials are parsed after 
which the connection is closed.

This could of course be a problem with the load balancer prematurely closing 
the connection.

I am trying to eliminate any causes in the ldap servers.

Any ideas on how to debug this or where we could look.

Greetings
Christian

[1] dns and ips obfuscated to protect the customer

--
Christian Kratzer                      CK Software GmbH
Email:   [email protected]                  Wildberger Weg 24/2
Phone:   +49 7032 893 997 - 0          D-71126 Gaeufelden
Fax:     +49 7032 893 997 - 9          HRB 245288, Amtsgericht Stuttgart
Web:     http://www.cksoft.de/         Geschaeftsfuehrer: Christian Kratzer

Reply via email to