On 4/22/2010 7:33 PM, Jaime Casanova wrote: > On Thu, Apr 22, 2010 at 12:58 AM, Jan Wieck <[email protected]> wrote: > >> You may be able to fix things by reinserting that sl_subscribe row with >> sub_active = false, then restart the slon for node 2 and see how far that >> gets you. >> > > yes, that makes receiver start accepting events again... it's trying > to get upto date now... > thanx for your help...
Jaime was so kind to provide me with a dump of the slony schema of node 2 and we were able to completely figure out what happened. The whole mess was started by using direct DDL against a subscriber under Slony 1.2.x. The attempted fix for this was to drop the table from the replication set via SET DROP TABLE, fix the table definitions and resubscribe it via a temp set. The subscription failed because of an inconsistency between the system catalog and the slony catalog on the subscriber. The exact steps after that are not 100% clear to me yet, but I think I understand them good enough to be able to reproduce them later down the road. The SUBSCRIBE SET is actually a two step operation. In the first step, the SUBSCRIBE_SET event causes the new subscriber and everyone in the path to create the sl_subscribe row, which causes all data forwarders to keep replication data until the new subscriber has confirmed it. The second step is an internal event, ENABLE_SUBSCRIPTION, that is generated automatically by the origin of the set and that kicks off the actual copy_set() call. That copy_set() failed due to the catalog inconsistency. What Jaime tried then was an UNSUBSCRIBE SET, which slonik issued against the half subscribed node 2, deleting the sl_subscribe row. The code in copy_set() doesn't use the parameters from the event, but expects the in memory runtime configuration data to know the data provider for the set. Since the sl_subscribe row is gone now, that information is missing and the -1 is the default value for a set, the node isn't subscribed to. I don't know exactly what the right fix for this bug is. My first gut feeling is to ignore the ENABLE_SUBSCRIPTION and generate another UNSUBSCRIBE_SET event just to clear out any sl_subscribe row existing in the cluster. Since I am in Toronto right now, I can discuss this with Steve Singer tomorrow morning. Thank you Jaime. Your patience on this matter helped to track down a very nasty bug that apparently had been lingering in the system for a long time. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin _______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
