On Fri, Aug 31, 2012 at 08:59:56AM -0400, Steve Singer wrote: > On 12-08-31 04:16 AM, Knut Ingvald Dietzel wrote: [cut] > > From what I have been able to find out so far, slonik should wait for > > the slon engine to restart, and then call failedNode2() on the node with > > the highest SYNC. Though, from the log above failedNode2() appears to > > be called twice, the second instance fails in getting lock, and the > > process of failing over node 1 to 4 fails. > > > > Firstly, is my interpretation in the vicinity of being correct? > > When Slonik (<=2.1.x) does a fail over it generates a 'fake' > FAILOVER event using a ev_origin=$failed_node with the highest > sequence number it can see of that failed node. It pushes this > event into sl_event on one of the remaining nodes. In the test case > you describe it sounds like that slon is still running on the failed > node. Slony <=2.1.x have numerous race conditions with failover one > of the ones I've seen is where a 'real' SYNC event ie 1,1234 that > escaped from the failed node can conflict with the faked FAILOVER > event 1,1234.
Hi, Steve. Thanks for the insight, and your explanation sounds very reasonable. > I rewrote a lot of the failover logic in 2.2 to try to address many > of these issues. It should do a much better job at waiting for > slons to restart etc... 2.2 is still beta and I wouldn't recommend > it for production use yet but I encourage you to look at it to see > if it addresses your issues. That's very good to hear. We'll look into possibilities of testing the 2.2b version. Again, thanks! -- Best regards, Knut Ingvald Dietzel
signature.asc
Description: Digital signature
_______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
