Howard Chu wrote:
Howard Chu wrote:
In at least one case I'm seeing a valid
update being rejected because the incoming cookie seems to have been
confused
with another one. This happens when a NEW_COOKIE message is received.
I'll
note that sending NEW_COOKIE messages is a recent change (ITS#5972),
and there
is no valid case for them to be occurring in test050. I.e., NEW_COOKIE
should
be sent in a partial replication situation, where an entry changed in the
naming context but it's not within the consumer's scope of interest. In
test050, the consumer's scope of interest is the entire naming
context. So
this at least gives me one area to look for a fix.
I agree, in a MMR configuration NEW_COOKIE messages should not have been
sent, except possibly when the entire csn set is updated at the end of a
refresh phase. But is looks more and more to me as if the fact that
test050 do show these messages is a symptom of some entry updates being
ignored by syncprov, or not passed to syncprov by syncrepl.
This piece of the ITS#5972 patch is part of the problem
--- syncprov.c 5 Mar 2009 16:53:01 -0000 1.266
+++ syncprov.c 12 Mar 2009 08:42:54 -0000
@@ -1245,7 +1245,7 @@
} else if ( !saveit && found ) {
/* send DELETE */
syncprov_qresp( opc, ss, LDAP_SYNC_DELETE );
- } else if ( !saveit ) {
+ } else if ( !saveit && !fc.fscope ) {
syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE );
}
if ( !saveit && found ) {
My diff above is also not the correct fix, which is why I haven't
committed it yet.
The current operation may not have been caught by the previous if
conditions for 3 reasons:
1) the change is out of the consumer's scope
2) the change doesn't match the consumer's filter
3) the change is older than the consumer's cookie
The NEW_COOKIE message must only be sent for conditions 1 and 2, but
it's currently also being sent for 3. Since the cookie comparison is
tacked onto the consumer's filter, an additional comparison is needed to
weed this out.
(Normally 3 can't be true, but this is MMR where the consumer might have
already received this change from some other provider.)
Syncprov generally doesn't know the exact state of its consumers in MMR
configurations, since the consumers CSNs could have been updated by one
of the other providers. So, the NEW_COOKIE messages should be sent in
all three cases, leaving the job of filtering out the too old CSNs to
the one that have enough information to do so, namely the consumer.
I haven't looked yet, but I suspect there is a corresponding bug in the
consumer where it acts on a NEW_COOKIE message whether it's valid or not.
No, the consumer silently ignores updates to CSN values older (or equal)
to the values it already knows about.
I'm also inclined to back out #5972 and its related patches (#5973,
#6001) for
this release. We were looking for bug fixes and stability, and they've
been quite destabilizing.
To me it looks more as the extended test050 have triggered race
conditions that already was there, and that especially the syncprov half
of ITS#5973 have added to the likelihood that they should be shown.
I have run the current test050 script with the 2.4.15 source (which
didn't include these patches), and with RE24 (as of two days ago)
without ITS#5973, and have seen the same type of failures there. Also,
had the problems been triggered by the consumers receiving NEW_COOKIE
messages then I would have expected to see "too old" messages on the
consumers when it ignores entries. Instead, I find no trace of the
missing entries ever being passed on from the provider. But where the
update is lost I haven't found out yet. The problem seem to occur when
the server where entries are missing receives its updates from one of
the other consumers (i.e, not directly from server1). But whether it is
syncrepl on this intermediate server that fails to pass it on to
syncprov, or syncprov that looses them, I don't know.
Also, I now have around 30 core files similar to the one in ITS#5999,
and I have also had a number of cases where I had to kill -9 a slapd
running in a tight unlock, yield, lock loop at the same place in
syncprov_op_mod(). These loops have all happened when slapd should be
stopping, and the mt structure looks equally invalid as with the seg.
fault cases. I have no idea as to whether this has anything to do with
the test050 failures or not.
Btw, all of the test050 failures I have seen due to missing replications
have taken place immediately after the initial loading of the consumers
from server1. This could be a coincident, but I have had enough or them
to start wondering...
Rein