Hi Steve, Please comment on the same.
Regards, Ranjith On Sat, Sep 25, 2010 at 9:47 AM, Ranjith <ranjith.nath...@gmail.com> wrote: > Hi Steve, > > Just to make it clear. Do you mean that in the above case If N3 is part of > the network, it should have connectivity to both N2 and N1 and if it happens > so > that N3 has connectivity to N2 only, corosync doesnot take care of the > same. > > Regards, > Ranjith > On Sat, Sep 25, 2010 at 9:39 AM, Steven Dake <sd...@redhat.com> wrote: > >> On 09/24/2010 08:20 PM, Ranjith wrote: >> >>> Hi , >>> It is hard to tell what is happening without logs from all 3 nodes. Does >>> this only happen at system start, or can you duplicate 5 minutes after >>> systems have started? >>> >>>> >> The cluster is never stabilizing. It keeps on switching between the >>>> >>> membership and operational state. >>> Below is the test network which i am using: >>> >>> Untitled.png >>> >>>> >> N1 and N3 does not reveive any packets from each other. Here what i >>>> >>> expected was that either (N1,N2) or (N2, N3) forms a two node cluster >>> and stabilizes. But the cluster is never stabilizing even though 2 node >>> clusters are forming, it is going back to membership [I checked the logs >>> and it looks like because of the steps i mentioned in the previous mail, >>> this seems to be happening] >>> >>> >>> >> ...... Where did you say you were testing a byzantine fault in your >> original bug report? Please be more forthcoming in the future. Corosync >> does not protect against byzantine faults. Allowing one way connectivity in >> network connection = this fault scenario. You can try coro-netctl (the >> attached script) which will atomically block a network ip in the network to >> test split brain scenarios without actually pulling network cables. >> >> Regards >> -steve >> >> >>> Regards, >>> Ranjith >>> On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake <sd...@redhat.com >>> <mailto:sd...@redhat.com>> wrote: >>> >>> It is hard to tell what is happening without logs from all 3 nodes. >>> Does this only happen at system start, or can you duplicate 5 >>> minutes after systems have started? >>> >>> If it is at system start, you may need to enable "fast STP" on your >>> switch. It looks to me like node 3 gets some messages through but >>> then is blocked. STP will do this in it's default state on most >>> switches. >>> >>> Another option if you can't enable STP is to use broadcast mode (man >>> openais.conf for details). >>> >>> Also verify firewalls are properly configured on all nodes. You can >>> join us on the irc server freenode on #linux-cluster for real-time >>> assistance. >>> >>> Regards >>> -steve >>> >>> >>> On 09/22/2010 11:33 PM, Ranjith wrote: >>> >>> Hi Steve, >>> I am running corosync 1.2.8 >>> I didn't get what u meant by blackbox. I suppose it is >>> logs/debugs. >>> I just checked logs/debugs and I am able to understand the >>> below: >>> 1--------------2--------------3 >>> 1) Node1 and Node2 are already in a 2node cluster >>> 2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list) >>> 3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to >>> ({1,2,3}, {}) >>> 4) Now Node 2 gets consensus after some messages [But 1 is the >>> rep] >>> 5) Consensus timeout fires at node 1 for node 3, node1 sends join >>> as >>> ({1,2}, {3}) >>> 6) Node2 updates because of the above message to ({1,2}, {3}) >>> and sends >>> out join. This join received by node 3 causes it to update >>> ({1,3}, {2}) >>> 7) Node1and Node2 enter operational (fail list cleared by node2) >>> but >>> node 3 join timeout fires and again membership state. >>> 8) This will continue to happen until consensus fires at node3 >>> for node1 >>> and it moves to ({3}, {1,2}) >>> 9) Now Node1and Node2 from 2 node cluster and 3 forms a single >>> node cluster >>> 10) Now node 2 broadcast a Normal message >>> 11) This message is received by Node3 as a foreign message which >>> forces >>> it to go to gather state >>> 12) Again above steps .... >>> The cluster is never stabilizing. >>> I have attached the debugs for Node2: >>> (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180) >>> Regards, >>> Ranjith >>> >>> On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <sd...@redhat.com >>> <mailto:sd...@redhat.com> >>> <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>> wrote: >>> >>> On 09/21/2010 11:15 PM, Ranjith wrote: >>> >>> Hi all, >>> Kindly comment on the above behaviour >>> Regards, >>> Ranjith >>> >>> On Tue, Sep 21, 2010 at 9:52 PM, Ranjith >>> <ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com> >>> <mailto:ranjith.nath...@gmail.com >>> <mailto:ranjith.nath...@gmail.com>> >>> <mailto:ranjith.nath...@gmail.com <mailto: >>> ranjith.nath...@gmail.com> >>> <mailto:ranjith.nath...@gmail.com >>> <mailto:ranjith.nath...@gmail.com>>>> wrote: >>> >>> Hi all, >>> I was testing the corosync cluster engine by using the >>> testcpg exec >>> provided along with the release. I am getting the >>> below >>> behaviour >>> while testing some specific scenarios. Kindly >>> comment on the >>> expected behaviour. >>> 1) 3 Node cluster >>> 1---------2---------3 >>> a) suppose I bring the nodes 1&2 up, it will form >>> a >>> ring (1,2) >>> b) now bring up 3 >>> c) 3 sends join which restarts the membership >>> process >>> d) (1,2) again forms the ring , 3 forms self >>> cluster >>> e) now 3 sends a join (due to join or other >>> timeout) >>> f) again membership protocol is started as 2 >>> responds >>> to this >>> by going to gather state ( i believe 2 should not >>> accept >>> this as 2 >>> would have earlier decided that 3 is failed) >>> I am seeing a continuous loop of the above >>> behaviour ( >>> operational -> membership -> operational -> ) due to >>> which the >>> cluster is not becoming stabilized >>> 2) 3 Node Cluster >>> 1---------2-----------3 >>> a) bring up all the three nodes at the same >>> time (None >>> of the >>> nodes have seen each other before this) >>> b) Now each node forms a cluster by itself .. >>> (Here i >>> think it >>> should from either a (1,2) or (2,3) ring ) >>> Regards, >>> Ranjith >>> >>> >>> >>> >>> Ranjith, >>> >>> Which version of corosync are you running? >>> >>> can you run corosync-blackbox and attach the output? >>> >>> Thanks >>> -steve >>> >>> >>> _______________________________________________ >>> Openais mailing list >>> Openais@lists.linux-foundation.org >>> <mailto:Openais@lists.linux-foundation.org> >>> <mailto:Openais@lists.linux-foundation.org >>> <mailto:Openais@lists.linux-foundation.org>> >>> >>> https://lists.linux-foundation.org/mailman/listinfo/openais >>> >>> >>> >>> >>> >>> >> >
_______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais