Hello,

I have two node cluster with qdisk and rgmanager. When I kill aisexec on the node1 (where the service BROKER is running) I get split brain situation. The service BROKER is runing on both nodes. When I upgraded rgmanager to the version from RH 5.5 BETA (rgmanager-2.0.52-3.el5) the split brain doesn't occures because of the IP is on the node1 (rhbz#526647). I think that rgmanager on node1 should handle this situation and stop BROKER service when the aisexec is down. The problem is because I use fence_scsi, and I would be the same with any SAN fencing fe. fence_brocade.

---node1---
Mar 5 10:48:52 node1 clurgmgrd: [10813]: <info> Executing /opt/webmeth/71_prodBroker/Broker/aw_broker71 status
Mar  5 10:49:14 node1 fenced[10361]: cluster is down, exiting
Mar  5 10:49:14 node1 gfs_controld[10373]: cluster is down, exiting
Mar  5 10:49:14 node1 dlm_controld[10367]: cluster is down, exiting
Mar  5 10:49:14 node1 kernel: dlm: closing connection to node 2
Mar  5 10:49:14 node1 kernel: dlm: closing connection to node 1
Mar  5 10:49:19 node1 qdiskd[10340]: <err> cman_dispatch: Host is down
Mar  5 10:49:19 node1 qdiskd[10340]: <err> Halting qdisk operations
Mar  5 10:49:25 node1 kernel: dlm: connect from non cluster node
Mar 5 10:49:42 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 30 seconds. Mar 5 10:50:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 60 seconds. Mar 5 10:50:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 90 seconds. Mar 5 10:51:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 120 seconds. Mar 5 10:51:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 150 seconds. Mar 5 10:52:13 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 180 seconds. Mar 5 10:52:43 node1 ccsd[10298]: Unable to connect to cluster infrastructure after 210 seconds.
---node1---

---node2---
Mar 5 10:50:47 node1 clurgmgrd[20822]: <info> Waiting for node #1 to be fenced Mar 5 10:51:11 node1 fenced[8540]: node1 not a cluster member after 30 sec post_fail_delay
Mar  5 10:51:11 node1 fenced[8540]: fencing node "node1"
Mar  5 10:51:11 node1 fenced[8540]: fence "node1" success
Mar  5 10:51:13 node1 clurgmgrd[20822]: <info> Node #1 fenced; continuing
Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Taking over service service:BROKER from down member node1 Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> mounting /dev/mapper/storage0-broker on /opt/webmeth/71_prodBroker/Broker/data
Mar  5 10:51:13 node1 kernel: kjournald starting.  Commit interval 5 seconds
Mar  5 10:51:13 node1 kernel: EXT3 FS on dm-7, internal journal
Mar 5 10:51:13 node1 kernel: EXT3-fs: mounted filesystem with ordered data mode. Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> Adding IPv4 address 192.168.33.18/24 to bond0 Mar 5 10:51:13 node1 clurgmgrd: [20822]: <err> IPv4 address collision 192.168.33.18 Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> start on ip "192.168.33.18/24" returned 1 (generic error) Mar 5 10:51:13 node1 clurgmgrd[20822]: <warning> #68: Failed to start service:BROKER; return value: 1 Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Stopping service service:BROKER Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> Executing /opt/webmeth/71_prodBroker/Broker/aw_broker71 stop Mar 5 10:51:13 node1 clurgmgrd: [20822]: <info> unmounting /opt/webmeth/71_prodBroker/Broker/data Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER is recovering Mar 5 10:51:13 node1 clurgmgrd[20822]: <warning> #71: Relocating failed service service:BROKER Mar 5 10:51:13 node1 clurgmgrd[20822]: <notice> Service service:BROKER is stopped
---node2---

I have tested this with cman from RH 5.5 (cman-2.0.115-29.el5) and cman for RH 5.4 BETA (cman-2.0.115-1.el5_4.9).

Here is my config.

---cut---

<cluster alias="PROD-RH-CLUSTER-BROKER" config_version="5" name="PROD-BROKER">
        <quorumd device="/dev/emcpowerb" interval="5" status_file="/root/qdiskstat" tko="8" 
votes="2">
                <heuristic interval="5" program="ping 192.168.33.254 -c1 -t1" score="1" 
tko="6"/>
                <heuristic interval="5" program="/usr/local/bin/smartTouch.sh 
/opt/webmeth/71_prodBroker/Broker/data" score="1" tko="6"/>
        </quorumd>
        <fence_daemon post_fail_delay="30" post_join_delay="120"/>
        <cman expected_votes="6" two_node="0" broadcast="yes" 
quorum_dev_poll="35000"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="2">
                        <fence>
                                <method name="1">
                                        <device name="scsi3-pr" node="node1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="2">
                        <fence>
                                <method name="1">
                                        <device name="scsi3-pr" node="node2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="scsi3-pr"/>
        </fencedevices>
        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="BROKER" ordered="1" 
restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>
                                <failoverdomainnode name="node2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.33.18/24" monitor_link="1"/>
                        <script file="/opt/webmeth/71_prodBroker/Broker/aw_broker71" 
name="broker"/>
                        <fs device="/dev/mapper/storage0-broker" force_fsck="1" force_unmount="1" fsid="29845" 
fstype="ext3" mountpoint="/opt/webmeth/71_prodBroker/Broker/data" name="BROKER-FS" options="" self_fence="1"/>
                </resources>
                <service autostart="1" domain="BROKER" name="BROKER">
                        <fs ref="BROKER-FS"/>
                        <ip ref="192.168.33.18/24"/>
                        <script ref="broker"/>
                </service>
        </rm>
        <totem consensus="4500" token="85000" 
token_retransmits_before_loss_const="20"/>
</cluster>

---cut---

Best Regards
Maciej Bogucki

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to