On Tue, Feb 28, 2012 at 11:49 AM, William Seligman
<selig...@nevis.columbia.edu> wrote:
> I'm trying to set up an active/active HA cluster as explained in Clusters From
> Scratch (which I just re-read after my last problem).
>
> I'll give versions and config files below, but I'll start with what happens. I
> start with an active/active cman+pacemaker+drbd+gfs2 cluster, with fencing
> enabled. My fencing mechanism cuts power to a node by turning the load off in
> its UPS. The two nodes are hypatia-tb and orestes-tb.
>
> I want to test fencing and recovery. I start with both nodes running, and
> resources properly running on both nodes. Then I simulate failure on one node,
> e.g., orestes-tb. I've done this with "crm node standby", "service pacemaker
> off", or by pulling the plug. As expected, all the resources move to 
> hypatia-tb,
> with the drbd resource as Primary.
>
> When I try to bring orestes-tb back into the cluster with "crm node online" or
> "service pacemaker on" (the inverse of how I removed it), orestes-tb is 
> fenced.
> OK, that makes sense, I guess; there's a potential split-brain situation.

Not really, that should only happen if the two nodes can't see each
other.  Which should not be the case.
Only when you pull the plug should orestes-tb be fenced.

Or if you're using a fencing device that requires the node to have
power, then I can imagine that turning it on again might result in
fencing.
But not for the other cases.


>
> I bring orestes-tb back up, with the intent of adding it back into the 
> cluster.
> I make sure cman, pacemaker, and drbd services were off at system start. On
> orestes-tb, I type "service drbd start".
>
> What I expect to happen is that the drbd resource on orestes-tb is marked
> "Outdated" or something like that. Then I'd fix it with "drbdadm
> --discard-my-data connect admin" or whatever is appropriate.
>
> What actually happens is that hypatia-tb is fenced. Since this is the node
> running all the resources, this is bad behavior. It's even more puzzling when 
> I
> consider that at, the time, there isn't any fencing resource actually running 
> on
> orestes-tb; my guess is that DRBD on hypatia-tb is fencing itself.
>
> Eventually hypatia-tb reboots, and the cluster goes back to normal. But as a
> fencing/stability/HA test, this is a failure.
>
> I've repeated this with a number of variations. In the end, both systems have 
> to
> be fenced/rebooted before the cluster is working again.
>
> Any ideas?
>
> Versions:
>
> Scientific Linux 6.2
> kernel 2.6.32
> cman-3.0.12
> corosync-1.4.1
> pacemaker-1.1.6
> drbd-8.4.1
>
> /etc/drbd.d/global-common.conf:
>
> global {
>        usage-count yes;
> }
>
> common {
>        startup {
>                wfc-timeout             60;
>                degr-wfc-timeout        60;
>                outdated-wfc-timeout    60;
>        }
> }
>
> /etc/drbd.d/admin.res:
>
> resource admin {
>
>        protocol C;
>
>        on hypatia-tb.nevis.columbia.edu {
>                volume 0 {
>                        device          /dev/drbd0;
>                        disk            /dev/md2;
>                        flexible-meta-disk      internal;
>                }
>                address         192.168.100.7:7788;
>        }
>        on orestes-tb.nevis.columbia.edu {
>                volume 0 {
>                        device          /dev/drbd0;
>                        disk            /dev/md2;
>                        flexible-meta-disk      internal;
>                }
>                address         192.168.100.6:7788;
>        }
>
>        startup {
>        }
>
>        net {
>                allow-two-primaries     yes;
>                after-sb-0pri      discard-zero-changes;
>                after-sb-1pri      discard-secondary;
>                after-sb-2pri      disconnect;
>                sndbuf-size 0;
>        }
>
>        disk {
>                resync-rate     100M;
>                c-max-rate      100M;
>                al-extents      3389;
>                fencing resource-only;
>        }
>
> An edited output of "crm configure show":
>
> node hypatia-tb.nevis.columbia.edu
> node orestes-tb.nevis.columbia.edu
> primitive StonithHypatia stonith:fence_nut \
>   params pcmk_host_check="static-list" \
>   pcmk_host_list="hypatia-tb.nevis.columbia.edu" \
>   ups="sofia-ups" username="admin" password="XXX"
> primitive StonithOrestes stonith:fence_nut \
>   params pcmk_host_check="static-list" \
>   pcmk_host_list="orestes-tb.nevis.columbia.edu"
>   ups="dc-test-stand-ups" username="admin" password="XXX"
> location StonithHypatiaLocation StonithHypatia \
>   -inf: hypatia-tb.nevis.columbia.edu
> location StonithOrestesLocation StonithOrestes \
>   -inf: orestes-tb.nevis.columbia.edu
>
> /etc/cluster/cluster.conf:
>
> <?xml version="1.0"?>
> <cluster config_version="17" name="Nevis_HA">
>  <logging debug="off"/>
>  <cman expected_votes="1" two_node="1" />
>  <clusternodes>
>    <clusternode name="hypatia-tb.nevis.columbia.edu" nodeid="1">
>      <altname name="hypatia-private.nevis.columbia.edu" port="5405"
> mcast="226.94.1.1"/>
>      <fence>
>        <method name="pcmk-redirect">
>          <device name="pcmk" port="hypatia-tb.nevis.columbia.edu"/>
>        </method>
>      </fence>
>    </clusternode>
>    <clusternode name="orestes-tb.nevis.columbia.edu" nodeid="2">
>      <altname name="orestes-private.nevis.columbia.edu" port="5405"
> mcast="226.94.1.1"/>
>      <fence>
>        <method name="pcmk-redirect">
>          <device name="pcmk" port="orestes-tb.nevis.columbia.edu"/>
>        </method>
>      </fence>
>    </clusternode>
>  </clusternodes>
>  <fencedevices>
>    <fencedevice name="pcmk" agent="fence_pcmk"/>
>  </fencedevices>
>  <fence_daemon post_join_delay="30" />
>  <rm disabled="1" />
> </cluster>
>
>
> The log messages on orestes-tb, just before hypatia-tb is fenced (there are no
> messages in the hypatia-tb log for this time):
>
> Feb 15 16:52:27 orestes-tb kernel: drbd: initialized. Version: 8.4.1
> (api:1/proto:86-100)
> Feb 15 16:52:27 orestes-tb kernel: drbd: GIT-hash:
> 91b4c048c1a0e06777b5f65d312b38d47abaea80 build by
> r...@orestes-tb.nevis.columbia.edu, 2012-02-14 17:05:32
> Feb 15 16:52:27 orestes-tb kernel: drbd: registered as block device major 147
> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Starting worker thread (from
> drbdsetup [2570])
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: disk( Diskless -> Attaching )
> Feb 15 16:52:27 orestes-tb kernel: d-con admin: Method to ensure write 
> ordering:
> barrier
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: max BIO size = 130560
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: Adjusting my ra_pages to 
> backing
> device's (32 -> 768)
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: drbd_bm_resize called with
> capacity == 5611549368
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: resync bitmap: bits=701443671
> words=10960058 pages=21407
> Feb 15 16:52:27 orestes-tb kernel: block drbd0: size = 2676 GB (2805774684 KB)
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: bitmap READ of 21407 pages 
> took
> 634 jiffies
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: recounting of set bits took
> additional 92 jiffies
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: 0 KB (0 bits) marked 
> out-of-sync
> by on disk bit-map.
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: disk( Attaching -> Outdated )
> Feb 15 16:52:28 orestes-tb kernel: block drbd0: attached to UUIDs
> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( StandAlone -> 
> Unconnected )
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: Starting receiver thread (from
> drbd_w_admin [2572])
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 15 16:52:28 orestes-tb kernel: d-con admin: conn( Unconnected -> 
> WFConnection )
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Handshake successful: Agreed
> network protocol version 100
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: conn( WFConnection ->
> WFReportParams )
> Feb 15 16:52:29 orestes-tb kernel: d-con admin: Starting asender thread (from
> drbd_r_admin [2579])
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: drbd_sync_handshake:
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: self
> F5355FCF6114F218:0000000000000000:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:0 
> flags:0
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer
> 06B93A6C54D6D631:F5355FCF6114F219:8A5519C7090D6BD6:8A5419C7090D6BD6 bits:615 
> flags:0
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: uuid_compare()=-1 by rule 50
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: peer( Unknown -> Primary ) 
> conn(
> WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 39(1), total 39; compression: 100.0%
> Feb 15 16:52:29 orestes-tb kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID 
> )
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: PingAck did not arrive in 
> time.
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: peer( Primary -> Unknown ) 
> conn(
> WFSyncUUID -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: asender terminated
> Feb 15 16:52:50 orestes-tb kernel: d-con admin: Terminating asender thread
> Feb 15 16:52:51 orestes-tb kernel: block drbd0: bitmap WRITE of 3 pages took 
> 247
> jiffies
> Feb 15 16:52:51 orestes-tb kernel: block drbd0: 2460 KB (615 bits) marked
> out-of-sync by on disk bit-map.
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Connection closed
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( NetworkFailure ->
> Unconnected )
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver terminated
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: Restarting receiver thread
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: receiver (re)started
> Feb 15 16:52:51 orestes-tb kernel: d-con admin: conn( Unconnected -> 
> WFConnection )
>
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to