Hello guys,
I'm trying to use one cluster with 2 nodes, using DRDB 8.x and GFS 1.x
on RHEL 5.2 x84_64.
The problem is: Then one machine was gone (node2) the node1 stop to work
(one simple 'ls -l' on shared mounted point) until the second machine
return.
I'm using GFS on this way:
# gfs_mkfs -t hotsite:gfs-00 -p lock_dlm -j 2 /dev/drbd0
# mount -v /dev/drbd0 /test
'Causing a FAIL on second node on this way:
# echo 1 > /proc/sys/kernel/sysrq
# echo b > /proc/sysrq-trigger
==============================================================================
$ cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="hotsite" config_version="4">
<cman two_node="1" expected_votes="1"/>
<fence_daemon post_join_delay="60">
</fence_daemon>
<clusternodes>
<clusternode name="drdb_hotsite-1" nodeid="1">
<fence>
<method name="single">
<device name="gnbd" ipaddr="192.168.0.3"/>
</method>
</fence>
</clusternode>
<clusternode name="drdb_hotsite-2" nodeid="2">
<fence>
<method name="single">
<device name="gnbd" ipaddr="192.168.0.3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="manual" agent="fence_manual"/>
</fencedevices>
</cluster>
==============================================================================
Follow the logs:
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: PingAck did not arrive in time.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: asender terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Terminating asender thread
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: short read expecting header on
sock: r=-512
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Creating new current UUID
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block
now.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Connection closed
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: helper command: /sbin/drbdadm
outdate-peer
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: outdate-peer helper broken,
returned 0
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from
bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:NetworkFailure
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:Unconnected
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( NetworkFailure ->
Unconnected )
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver (re)started
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from
bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: old = { cs:Unconnected
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFConnection
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( Unconnected ->
WFConnection )
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] The token was lost in
the OPERATIONAL state.
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Receive multicast
socket recv buffer size (288000 bytes).
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Transmit multicast
socket send buffer size (262142 bytes).
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state
from 2.
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: drdb_hotsite-2 not a cluster
member after 0 sec post_fail_delay
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state
from 0.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token
because I am the rep.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 31
high seq received 31
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id
for ring 168
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY
state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member
192.168.0.3:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 356
rep 192.168.0.3
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 31 high delivered
31 received flag 1
Jun 11 19:59:12 hotsite-bsb-la-1 kernel: dlm: closing connection to node 2
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to
originate any messages in recovery.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF
token
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION
CHANGE
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.3)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.4)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION
CHANGE
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.3)
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the
primary component and will provide service.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL
state.
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message
192.168.0.3
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message
from node 1
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
.....
Jun 11 20:01:32 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state
from 11.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token
because I am the rep.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 14
high seq received 14
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id
for ring 16c
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY
state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member
192.168.0.3:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360
rep 192.168.0.3
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 14 high delivered
14 received flag 1
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [1] member
192.168.0.4:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360
rep 192.168.0.4
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 9 high delivered 9
received flag 1
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to
originate any messages in recovery.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF
token
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION
CHANGE
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.3)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] CLM CONFIGURATION
CHANGE
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] New Configuration:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.3)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.4)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Left:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] Members Joined:
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] r(0)
ip(192.168.0.4)
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the
primary component and will provide service.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL
state.
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message
192.168.0.4
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM ] got nodejoin message
192.168.0.3
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CPG ] got joinlist message
from node 1
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Trying to acquire journal lock...
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Looking at journal...
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Handshake successful: Agreed
network protocol version 88
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Peer authenticated using 20
bytes of 'sha1' HMAC
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Considering state change from
bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: old = { cs:WFConnection
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: new = { cs:WFReportParams
st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFConnection ->
WFReportParams )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Starting asender thread (from
drbd0_receiver [526])
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: data-integrity-alg: <not-used>
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Unknown -> Secondary )
conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Outdated )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block
now.
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: tl_clear()
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: susp( 1 -> 0 )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Secondary -> Primary )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFBitMapS -> SyncSource )
pdsk( Outdated -> Inconsistent )
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Began resync as SyncSource
(will sync 548864 KB [137216 bits set]).
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block
now.
Jun 11 20:05:05 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Acquiring the transaction lock...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Replaying journal...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Replayed 0 of 1 blocks
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
replays = 0, skips = 0, sames = 1
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Journal replayed in 5s
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Resync done (total 15 sec;
paused 0 sec; 36588 K/sec)
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block
now.
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Trying to join cluster "lock_dlm",
"hotsite:gfs-00"
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: dlm: Using TCP for communications
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Joined cluster. Now mounting FS...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0:
Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0:
Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Done
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1:
Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:07:25 hotsite-bsb-la-1 kernel: dlm: connecting to 2
Thanks!
--
Tiago Cruz
http://everlinux.com
Linux User #282636
--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster