On Wed, Jul 01, 2009 at 06:21:14PM +0200, Jan Friesse wrote: > Included patch should fix > https://bugzilla.redhat.com/show_bug.cgi?id=506255 . > > David, I hope it will fix problem for you. > > It's based on simple idea of adding node startup timestamp at the end of > cpg_join (and joinlist) calls. If timestamp is larger then old timestamp > we know, node was restarted and we didn't notices -> deliver leave event > and then join event. If timestamp is same (or in special cases lower) -> > new cpg app joined -> send only join event. > > Of course, patch isn't so simple. Cpg_join messages are always send as > larger messages with timestamp (btw. timestamp is 64-bit value, because > I expect l(o^64)ng life of corosync ;) ). On delivery, we test, if > message is larger then standard message. If it is -> we have ts -> use it. > > Bigger problem was joinlist, because it's array, ... you will see in > source. Solution is, to send special entry, with pid 0 (shouldn't ever > happened to process, to have pid 0), and timestamp encoded in name > (ugly, but looks like working). > > Please comment, if you can.
This isn't specifically a cpg bug/problem, it's a problem with corosync/openais in general. When a node joins the cluster before others have recognized it failed, the other nodes should immediately recognize it has previously failed and process a complete failure for it. Dave _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais