Hi,

I've encountered issues with a MMR setup, which looks like the following:

     A ------- B
       \           /
         \       /
           \   /
             C

The replication works for approximately 24 hours. There are not many changes to 
the content anyway. After about 1 day, the attribute  value of the type 
"nsds5replicaLastUpdateStatus"  changes to "1 Can't acquire busy replica " of 
the replication agreement object from type "nsDS5ReplicationAgreement".  I see 
this message on C for the agreement "C-to-B".  The start-time of the last 
update is 01:08:33.  When I check the status on B, it looks fine for "B-to-C" 
and "B-to-A", however, the start-time of the last update is stuck at 01:08:36 
for "B-to-C", whereas A gets updated afterwards as well. I don't have the 
values for A!

When, I check errors and access on the boxes, I see the following:

Errors on A:
[10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Warning: unable to receive endReplication extended operation response (Timed 
out)
[10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] 
seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Timed out). Will retry later.
[10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Unable to receive the response for a startReplication extended operation to 
consumer (Can't contact LDAP server). Will retry later.
[10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed
[10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] 
seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the 
LDAP server) ((null))
[10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind 
request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't 
connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 
(Operation now in progress)
[10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] 
seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): 
Replication bind with SIMPLE auth resumed

Errors on B:
[10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" (A:389): 
Replication bind with SIMPLE auth resumed

Errors on C:
No entry for the time

I also analyzed the access files and see the following:
>From access on A:
I see replication sessions from B and C, established at 01:00:10 and closed at 
01:01:14.
Next replication session from C, established at 01:05:12 and closed at 
01:06:12. There was NO session from B.
B and C establish the next replication session at 01:07:52. During the session, 
a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session 
from B stays open, and I see a ABANDOM request from A at 01:17:53. Eventually, 
the session gets closed (no unbind) at 01:23:25.
[10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND msgid=3
Replication sessions from C continue to work fine.
There is one more replication session from B, established at 01:27:56 and 
closed at 01:28:38. After this session, B does not establish any replication 
session anymore.

>From access on B:
I see replication sessions from A and C, established at 01:00:09 and closed at 
01:01:14.
Next replication session from C, established at 01:05:12 and closed at 
01:06:12. There was NO session from A.
A and C establish the next replication session at 01:07:52. During the session, 
a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session 
from A stays open forever!!!
Replication sessions from C continue to work fine.

>From access on C:
I see replication sessions from A and B, established at 01:00:09 and closed at 
01:01:14.
I don't see any replication session from the 01:05 time-slot
A and B establish the next replication session at 01:07:53. During the session, 
a MOD is performed at 01:08:33. A closes this session at 01:09:33 and B closes 
session at 01:09:38
>From now on, I only see replication sessions from A!

Has anybody seen this kind of behavior? Any feedback is highly appreciated.

Thanks,
-Reinhard
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Reply via email to