Re: [Openais] Corosync 1.2.8 totem membership behaviour

Ranjith Sun, 26 Sep 2010 21:45:08 -0700

Hi Steve,

Please comment on the same.


Regards,
Ranjith

On Sat, Sep 25, 2010 at 9:47 AM, Ranjith <ranjith.nath...@gmail.com> wrote:

> Hi Steve,
>
> Just to make it clear. Do you mean that in the above case If N3 is part of
> the network, it should have connectivity to both N2 and N1 and if it happens
> so
> that N3 has connectivity to N2 only, corosync doesnot take care of the
> same.
>
> Regards,
> Ranjith
>   On Sat, Sep 25, 2010 at 9:39 AM, Steven Dake <sd...@redhat.com> wrote:
>
>> On 09/24/2010 08:20 PM, Ranjith wrote:
>>
>>> Hi ,
>>> It is hard to tell what is happening without logs from all 3 nodes. Does
>>> this only happen at system start, or can you duplicate 5 minutes after
>>> systems have started?
>>>
>>>> >> The cluster is never stabilizing. It keeps on switching between the
>>>>
>>> membership and operational state.
>>> Below is the test network which i am using:
>>>
>>> Untitled.png
>>>
>>>> >> N1 and N3 does not reveive any packets from each other. Here what i
>>>>
>>> expected was that either (N1,N2) or (N2, N3) forms a two node cluster
>>> and stabilizes. But the cluster is never stabilizing even though 2 node
>>> clusters are forming, it is going back to membership [I checked the logs
>>> and it looks like because of the steps i mentioned in the previous mail,
>>> this seems to be happening]
>>>
>>>
>>>
>> ......  Where did you say you were testing a byzantine fault in your
>> original bug report?  Please be more forthcoming in the future. Corosync
>> does not protect against byzantine faults.  Allowing one way connectivity in
>> network connection = this fault scenario.  You can try coro-netctl (the
>> attached script) which will atomically block a network ip in the network to
>> test split brain scenarios without actually pulling network cables.
>>
>> Regards
>> -steve
>>
>>
>>> Regards,
>>> Ranjith
>>> On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake <sd...@redhat.com
>>>  <mailto:sd...@redhat.com>> wrote:
>>>
>>>    It is hard to tell what is happening without logs from all 3 nodes.
>>>    Does this only happen at system start, or can you duplicate 5
>>>    minutes after systems have started?
>>>
>>>    If it is at system start, you may need to enable "fast STP" on your
>>>    switch.  It looks to me like node 3 gets some messages through but
>>>    then is blocked.  STP will do this in it's default state on most
>>>    switches.
>>>
>>>    Another option if you can't enable STP is to use broadcast mode (man
>>>    openais.conf for details).
>>>
>>>    Also verify firewalls are properly configured on all nodes.  You can
>>>    join us on the irc server freenode on #linux-cluster for real-time
>>>    assistance.
>>>
>>>    Regards
>>>    -steve
>>>
>>>
>>>    On 09/22/2010 11:33 PM, Ranjith wrote:
>>>
>>>        Hi Steve,
>>>          I am running corosync 1.2.8
>>>          I didn't get what u meant by blackbox. I suppose it is
>>>        logs/debugs.
>>>          I just checked logs/debugs and I am able to understand the
>>> below:
>>>                                           1--------------2--------------3
>>>        1) Node1 and Node2 are already in a 2node cluster
>>>        2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list)
>>>        3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to
>>>        ({1,2,3}, {})
>>>        4) Now Node 2 gets consensus after some messages [But 1 is the
>>> rep]
>>>        5) Consensus timeout fires at node 1 for node 3, node1 sends join
>>> as
>>>        ({1,2}, {3})
>>>        6) Node2 updates because of the above message to ({1,2}, {3})
>>>        and sends
>>>        out join. This join received by node 3 causes it to update
>>>        ({1,3}, {2})
>>>        7) Node1and Node2 enter operational (fail list cleared by node2)
>>> but
>>>        node 3 join timeout fires and again membership state.
>>>        8) This will continue to happen until consensus fires at node3
>>>        for node1
>>>        and it moves to ({3}, {1,2})
>>>        9) Now Node1and Node2 from 2 node cluster and 3 forms a single
>>>        node cluster
>>>        10) Now node 2 broadcast a Normal message
>>>        11) This message is received by Node3 as a foreign message which
>>>        forces
>>>        it to go to gather state
>>>        12) Again above steps ....
>>>        The cluster is never stabilizing.
>>>        I have attached the debugs for Node2:
>>>        (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180)
>>>        Regards,
>>>        Ranjith
>>>
>>>        On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <sd...@redhat.com
>>>        <mailto:sd...@redhat.com>
>>>         <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>> wrote:
>>>
>>>            On 09/21/2010 11:15 PM, Ranjith wrote:
>>>
>>>                Hi all,
>>>                Kindly comment on the above behaviour
>>>                Regards,
>>>                Ranjith
>>>
>>>                On Tue, Sep 21, 2010 at 9:52 PM, Ranjith
>>>        <ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>
>>>        <mailto:ranjith.nath...@gmail.com <mailto:
>>> ranjith.nath...@gmail.com>
>>>        <mailto:ranjith.nath...@gmail.com
>>>        <mailto:ranjith.nath...@gmail.com>>>> wrote:
>>>
>>>                    Hi all,
>>>                    I was testing the corosync cluster engine by using the
>>>                testcpg exec
>>>                    provided along with the release. I am getting the
>>> below
>>>                behaviour
>>>                    while testing some specific scenarios. Kindly
>>>        comment on the
>>>                    expected behaviour.
>>>                    1)   3 Node cluster
>>>                                       1---------2---------3
>>>                         a) suppose I bring the nodes 1&2 up, it will form
>>> a
>>>                ring (1,2)
>>>                         b) now bring up 3
>>>                         c) 3 sends join which restarts the membership
>>>        process
>>>                         d) (1,2) again forms the ring , 3 forms self
>>>        cluster
>>>                         e) now 3 sends a join (due to join or other
>>>        timeout)
>>>                         f) again membership protocol is started as 2
>>>        responds
>>>                to this
>>>                    by going to gather state ( i believe 2 should not
>>> accept
>>>                this as 2
>>>                    would have earlier decided that 3 is failed)
>>>                         I am seeing a continuous loop of the above
>>>        behaviour  (
>>>                    operational -> membership -> operational -> ) due to
>>>        which the
>>>                    cluster is not becoming stabilized
>>>                    2)   3 Node Cluster
>>>                                       1---------2-----------3
>>>                          a) bring up all the three nodes at the same
>>>        time (None
>>>                of the
>>>                    nodes have seen each other before this)
>>>                          b) Now each node forms a cluster by itself ..
>>>        (Here i
>>>                think it
>>>                    should from either a (1,2) or (2,3) ring )
>>>                    Regards,
>>>                    Ranjith
>>>
>>>
>>>
>>>
>>>            Ranjith,
>>>
>>>            Which version of corosync are you running?
>>>
>>>            can you run corosync-blackbox and attach the output?
>>>
>>>            Thanks
>>>            -steve
>>>
>>>
>>>                _______________________________________________
>>>                Openais mailing list
>>>        Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>
>>>        <mailto:Openais@lists.linux-foundation.org
>>>        <mailto:Openais@lists.linux-foundation.org>>
>>>
>>>        https://lists.linux-foundation.org/mailman/listinfo/openais
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync 1.2.8 totem membership behaviour

Reply via email to