Re: Please test RE24

Rein Tollevik Fri, 13 Mar 2009 08:08:53 -0700

Howard Chu wrote:

Howard Chu wrote:
In at least one case I'm seeing a valid
update being rejected because the incoming cookie seems to have beenconfusedwith another one. This happens when a NEW_COOKIE message is received.I'llnote that sending NEW_COOKIE messages is a recent change (ITS#5972),and thereis no valid case for them to be occurring in test050. I.e., NEW_COOKIEshould
be sent in a partial replication situation, where an entry changed in the
naming context but it's not within the consumer's scope of interest. In
test050, the consumer's scope of interest is the entire namingcontext. So
this at least gives me one area to look for a fix.

I agree, in a MMR configuration NEW_COOKIE messages should not have beensent, except possibly when the entire csn set is updated at the end of arefresh phase. But is looks more and more to me as if the fact thattest050 do show these messages is a symptom of some entry updates beingignored by syncprov, or not passed to syncprov by syncrepl.

This piece of the ITS#5972 patch is part of the problem
--- syncprov.c    5 Mar 2009 16:53:01 -0000    1.266
+++ syncprov.c    12 Mar 2009 08:42:54 -0000
@@ -1245,7 +1245,7 @@
         } else if ( !saveit && found ) {
             /* send DELETE */
             syncprov_qresp( opc, ss, LDAP_SYNC_DELETE );
-        } else if ( !saveit ) {
+        } else if ( !saveit && !fc.fscope ) {
             syncprov_qresp( opc, ss, LDAP_SYNC_NEW_COOKIE );
         }
         if ( !saveit && found ) {
My diff above is also not the correct fix, which is why I haven'tcommitted it yet.
The current operation may not have been caught by the previous ifconditions for 3 reasons:
    1) the change is out of the consumer's scope
    2) the change doesn't match the consumer's filter
    3) the change is older than the consumer's cookie
The NEW_COOKIE message must only be sent for conditions 1 and 2, butit's currently also being sent for 3. Since the cookie comparison istacked onto the consumer's filter, an additional comparison is needed toweed this out.(Normally 3 can't be true, but this is MMR where the consumer might havealready received this change from some other provider.)

Syncprov generally doesn't know the exact state of its consumers in MMRconfigurations, since the consumers CSNs could have been updated by oneof the other providers. So, the NEW_COOKIE messages should be sent inall three cases, leaving the job of filtering out the too old CSNs tothe one that have enough information to do so, namely the consumer.

I haven't looked yet, but I suspect there is a corresponding bug in theconsumer where it acts on a NEW_COOKIE message whether it's valid or not.

No, the consumer silently ignores updates to CSN values older (or equal)to the values it already knows about.

I'm also inclined to back out #5972 and its related patches (#5973,#6001) forthis release. We were looking for bug fixes and stability, and they'vebeen quite destabilizing.

To me it looks more as the extended test050 have triggered raceconditions that already was there, and that especially the syncprov halfof ITS#5973 have added to the likelihood that they should be shown.

I have run the current test050 script with the 2.4.15 source (whichdidn't include these patches), and with RE24 (as of two days ago)without ITS#5973, and have seen the same type of failures there. Also,had the problems been triggered by the consumers receiving NEW_COOKIEmessages then I would have expected to see "too old" messages on theconsumers when it ignores entries. Instead, I find no trace of themissing entries ever being passed on from the provider. But where theupdate is lost I haven't found out yet. The problem seem to occur whenthe server where entries are missing receives its updates from one ofthe other consumers (i.e, not directly from server1). But whether it issyncrepl on this intermediate server that fails to pass it on tosyncprov, or syncprov that looses them, I don't know.

Also, I now have around 30 core files similar to the one in ITS#5999,and I have also had a number of cases where I had to kill -9 a slapdrunning in a tight unlock, yield, lock loop at the same place insyncprov_op_mod(). These loops have all happened when slapd should bestopping, and the mt structure looks equally invalid as with the seg.fault cases. I have no idea as to whether this has anything to do withthe test050 failures or not.

Btw, all of the test050 failures I have seen due to missing replicationshave taken place immediately after the initial loading of the consumersfrom server1. This could be a coincident, but I have had enough or themto start wondering...


Rein

Re: Please test RE24

Reply via email to