On Fri, Aug 31, 2012 at 08:59:56AM -0400, Steve Singer wrote:
> On 12-08-31 04:16 AM, Knut Ingvald Dietzel wrote:
[cut]
> >  From what I have been able to find out so far, slonik should wait for
> > the slon engine to restart, and then call failedNode2() on the node with
> > the highest SYNC.  Though, from the log above failedNode2() appears to
> > be called twice, the second instance fails in getting lock, and the
> > process of failing over node 1 to 4 fails.
> >
> > Firstly, is my interpretation in the vicinity of being correct?
> 
> When Slonik (<=2.1.x) does a fail over it generates a 'fake'
> FAILOVER event using a ev_origin=$failed_node with the highest
> sequence number it can see of that failed node.  It pushes this
> event into sl_event on one of the remaining nodes.  In the test case
> you describe it sounds like that slon is still running on the failed
> node.  Slony <=2.1.x have numerous race conditions with failover one
> of the ones I've seen is where a 'real' SYNC event ie 1,1234 that
> escaped from the failed node can conflict with the faked FAILOVER
> event 1,1234.

Hi, Steve.

Thanks for the insight, and your explanation sounds very reasonable.

> I rewrote a lot of the failover logic in 2.2 to try to address many
> of these issues.  It should do a much better job at waiting for
> slons to restart etc...  2.2 is still beta and I wouldn't recommend
> it for production use yet but I encourage you to look at it to see
> if it addresses your issues.

That's very good to hear. We'll look into possibilities of testing the
2.2b version.

Again, thanks!


-- 
Best regards,
Knut Ingvald Dietzel

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to