On 09/24/2010 08:20 PM, Ranjith wrote:
Hi ,
It is hard to tell what is happening without logs from all 3 nodes. Does
this only happen at system start, or can you duplicate 5 minutes after
systems have started?
>> The cluster is never stabilizing. It keeps on switching between the
membership and operational state.
Below is the test network which i am using:

Untitled.png
>> N1 and N3 does not reveive any packets from each other. Here what i
expected was that either (N1,N2) or (N2, N3) forms a two node cluster
and stabilizes. But the cluster is never stabilizing even though 2 node
clusters are forming, it is going back to membership [I checked the logs
and it looks like because of the steps i mentioned in the previous mail,
this seems to be happening]



...... Where did you say you were testing a byzantine fault in your original bug report? Please be more forthcoming in the future. Corosync does not protect against byzantine faults. Allowing one way connectivity in network connection = this fault scenario. You can try coro-netctl (the attached script) which will atomically block a network ip in the network to test split brain scenarios without actually pulling network cables.

Regards
-steve


Regards,
Ranjith
On Fri, Sep 24, 2010 at 11:36 PM, Steven Dake <sd...@redhat.com
<mailto:sd...@redhat.com>> wrote:

    It is hard to tell what is happening without logs from all 3 nodes.
    Does this only happen at system start, or can you duplicate 5
    minutes after systems have started?

    If it is at system start, you may need to enable "fast STP" on your
    switch.  It looks to me like node 3 gets some messages through but
    then is blocked.  STP will do this in it's default state on most
    switches.

    Another option if you can't enable STP is to use broadcast mode (man
    openais.conf for details).

    Also verify firewalls are properly configured on all nodes.  You can
    join us on the irc server freenode on #linux-cluster for real-time
    assistance.

    Regards
    -steve


    On 09/22/2010 11:33 PM, Ranjith wrote:

        Hi Steve,
          I am running corosync 1.2.8
          I didn't get what u meant by blackbox. I suppose it is
        logs/debugs.
          I just checked logs/debugs and I am able to understand the below:
                                           1--------------2--------------3
        1) Node1 and Node2 are already in a 2node cluster
        2) Now Node3 sends join with ({1} , {} ) (proc_list/fail_list)
        3) Node2 sends join ({1,2,3} , {}) and Node 1/3 updates to
        ({1,2,3}, {})
        4) Now Node 2 gets consensus after some messages [But 1 is the rep]
        5) Consensus timeout fires at node 1 for node 3, node1 sends join as
        ({1,2}, {3})
        6) Node2 updates because of the above message to ({1,2}, {3})
        and sends
        out join. This join received by node 3 causes it to update
        ({1,3}, {2})
        7) Node1and Node2 enter operational (fail list cleared by node2) but
        node 3 join timeout fires and again membership state.
        8) This will continue to happen until consensus fires at node3
        for node1
        and it moves to ({3}, {1,2})
        9) Now Node1and Node2 from 2 node cluster and 3 forms a single
        node cluster
        10) Now node 2 broadcast a Normal message
        11) This message is received by Node3 as a foreign message which
        forces
        it to go to gather state
        12) Again above steps ....
        The cluster is never stabilizing.
        I have attached the debugs for Node2:
        (1 - 10.102.33.115, 2 - 10.102.33.150, 3 -10.102.33.180)
        Regards,
        Ranjith

        On Wed, Sep 22, 2010 at 10:53 PM, Steven Dake <sd...@redhat.com
        <mailto:sd...@redhat.com>
        <mailto:sd...@redhat.com <mailto:sd...@redhat.com>>> wrote:

            On 09/21/2010 11:15 PM, Ranjith wrote:

                Hi all,
                Kindly comment on the above behaviour
                Regards,
                Ranjith

                On Tue, Sep 21, 2010 at 9:52 PM, Ranjith
        <ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com>
        <mailto:ranjith.nath...@gmail.com
        <mailto:ranjith.nath...@gmail.com>>
        <mailto:ranjith.nath...@gmail.com <mailto:ranjith.nath...@gmail.com>
        <mailto:ranjith.nath...@gmail.com
        <mailto:ranjith.nath...@gmail.com>>>> wrote:

                    Hi all,
                    I was testing the corosync cluster engine by using the
                testcpg exec
                    provided along with the release. I am getting the below
                behaviour
                    while testing some specific scenarios. Kindly
        comment on the
                    expected behaviour.
                    1)   3 Node cluster
                                       1---------2---------3
                         a) suppose I bring the nodes 1&2 up, it will form a
                ring (1,2)
                         b) now bring up 3
                         c) 3 sends join which restarts the membership
        process
                         d) (1,2) again forms the ring , 3 forms self
        cluster
                         e) now 3 sends a join (due to join or other
        timeout)
                         f) again membership protocol is started as 2
        responds
                to this
                    by going to gather state ( i believe 2 should not accept
                this as 2
                    would have earlier decided that 3 is failed)
                         I am seeing a continuous loop of the above
        behaviour  (
                    operational -> membership -> operational -> ) due to
        which the
                    cluster is not becoming stabilized
                    2)   3 Node Cluster
                                       1---------2-----------3
                          a) bring up all the three nodes at the same
        time (None
                of the
                    nodes have seen each other before this)
                          b) Now each node forms a cluster by itself ..
        (Here i
                think it
                    should from either a (1,2) or (2,3) ring )
                    Regards,
                    Ranjith




            Ranjith,

            Which version of corosync are you running?

            can you run corosync-blackbox and attach the output?

            Thanks
            -steve


                _______________________________________________
                Openais mailing list
        Openais@lists.linux-foundation.org
        <mailto:Openais@lists.linux-foundation.org>
        <mailto:Openais@lists.linux-foundation.org
        <mailto:Openais@lists.linux-foundation.org>>

        https://lists.linux-foundation.org/mailman/listinfo/openais






#!/bin/sh

drop() {
        tmpf=tmp.txt

        echo *filter > $tmpf
        echo :INPUT ACCEPT [40:3040] >> $tmpf
        echo -A INPUT -m state --state NEW -p udp --dport 5404 -j DROP >> $tmpf
        echo -A INPUT -m state --state NEW -p udp --dport 5405 -j DROP >> $tmpf
        #echo -A INPUT -m pkttype --pkt-type multicast -j DROP >> $tmpf
        echo :FORWARD ACCEPT [0:0] >> $tmpf
        echo :OUTPUT ACCEPT [21:2164] >> $tmpf
        echo -A OUTPUT -m state --state NEW -p udp --dport 5404 -j DROP >> $tmpf
        echo -A OUTPUT -m state --state NEW -p udp --dport 5405 -j DROP >> $tmpf
        #echo -A OUTPUT -p udp -m multiport --dports 5404,5405 -j DROP >> $tmpf
        #echo -A OUTPUT -m pkttype --pkt-type multicast -j DROP >> $tmpf
        echo COMMIT >> $tmpf

        cat $tmpf | iptables-restore
        rm -f $tmpf
}

if [ "$1" = "drop" ]
then
        drop
else
        iptables -F
fi

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to