On 07/09/2012 12:58 PM, Nikola Ciprich wrote: > Hello Andreas, > > yes, You're right. I should have sent those in the initial post. Sorry about > that. > I've created very simple test configuration on which I'm able to simulate the > problem. > there's no stonith etc, since it's just two virtual machines for the test. > > crm conf: > > primitive drbd-sas0 ocf:linbit:drbd \ > params drbd_resource="drbd-sas0" \ > operations $id="drbd-sas0-operations" \ > op start interval="0" timeout="240s" \ > op stop interval="0" timeout="200s" \ > op promote interval="0" timeout="200s" \ > op demote interval="0" timeout="200s" \ > op monitor interval="179s" role="Master" timeout="150s" \ > op monitor interval="180s" role="Slave" timeout="150s" > > primitive lvm ocf:lbox:lvm.ocf \
Why not using the RA that comes with the resource-agent package? > op start interval="0" timeout="180" \ > op stop interval="0" timeout="180" > > ms ms-drbd-sas0 drbd-sas0 \ > meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" > notify="true" globally-unique="false" interleave="true" target-role="Started" > > clone cl-lvm lvm \ > meta globally-unique="false" ordered="false" interleave="true" > notify="false" target-role="Started" \ > params lvm-clone-max="2" lvm-clone-node-max="1" > > colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master > > order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start > > property $id="cib-bootstrap-options" \ > dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" > > lvm resource starts vgshared volume group on top of drbd (LVM filters are set > to > use /dev/drbd* devices only) > > drbd configuration: > > global { > usage-count no; > } > > common { > protocol C; > > handlers { > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; "; > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; "; > local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; "; > > # pri-on-incon-degr > "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > # pri-lost-after-sb > "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > # local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; > halt -f"; > # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > # before-resync-target > "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; > # after-resync-target > /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; > } > > net { > allow-two-primaries; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri call-pri-lost-after-sb; > #rr-conflict disconnect; > max-buffers 8000; > max-epoch-size 8000; > sndbuf-size 0; > ping-timeout 50; > } > > syncer { > rate 100M; > al-extents 3833; > # al-extents 257; > # verify-alg sha1; > } > > disk { > on-io-error detach; > no-disk-barrier; > no-disk-flushes; > no-md-flushes; > } > > startup { > # wfc-timeout 0; > degr-wfc-timeout 120; # 2 minutes. > # become-primary-on both; this "become-primary-on" was never activated? > > } > } > > note that pri-on-incon-degr etc handlers are intentionally commented out so I > can > see what's going on.. otherwise machine always got immediate reboot.. > > any idea? Is the drbd init script deactivated on system boot? Cluster logs should give more insights .... Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > thanks a lot in advance > > nik > > > On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote: >> On 07/02/2012 11:49 PM, Nikola Ciprich wrote: >>> hello, >>> >>> I'm trying to solve quite mysterious problem here.. >>> I've got new cluster with bunch of SAS disks for testing purposes. >>> I've configured DRBDs (in primary/primary configuration) >>> >>> when I start drbd using drbdadm, it get's up nicely (both nodes >>> are Primary, connected). >>> however when I start it using corosync, I always get split-brain, although >>> there are no data written, no network disconnection, anything.. >> >> your full drbd and Pacemaker configuration please ... some snippets from >> something are very seldom helpful ... >> >> Regards, >> Andreas >> >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >>> >>> here's drbd resource config: >>> primitive drbd-sas0 ocf:linbit:drbd \ >>> params drbd_resource="drbd-sas0" \ >>> operations $id="drbd-sas0-operations" \ >>> op start interval="0" timeout="240s" \ >>> op stop interval="0" timeout="200s" \ >>> op promote interval="0" timeout="200s" \ >>> op demote interval="0" timeout="200s" \ >>> op monitor interval="179s" role="Master" timeout="150s" \ >>> op monitor interval="180s" role="Slave" timeout="150s" >>> >>> ms ms-drbd-sas0 drbd-sas0 \ >>> meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" >>> notify="true" globally-unique="false" interleave="true" >>> target-role="Started" >>> >>> >>> here's the dmesg output when pacemaker tries to promote drbd, causing the >>> splitbrain: >>> [ 157.646292] block drbd2: Starting worker thread (from drbdsetup [6892]) >>> [ 157.646539] block drbd2: disk( Diskless -> Attaching ) >>> [ 157.650364] block drbd2: Found 1 transactions (1 active extents) in >>> activity log. >>> [ 157.650560] block drbd2: Method to ensure write ordering: drain >>> [ 157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688 >>> [ 157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 >>> pages=2231 >>> [ 157.653760] block drbd2: size = 279 GB (292333844 KB) >>> [ 157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies >>> [ 157.673722] block drbd2: recounting of set bits took additional 2 jiffies >>> [ 157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk >>> bit-map. >>> [ 157.673972] block drbd2: disk( Attaching -> UpToDate ) >>> [ 157.674100] block drbd2: attached to UUIDs >>> 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263 >>> [ 157.685539] block drbd2: conn( StandAlone -> Unconnected ) >>> [ 157.685704] block drbd2: Starting receiver thread (from drbd2_worker >>> [6893]) >>> [ 157.685928] block drbd2: receiver (re)started >>> [ 157.686071] block drbd2: conn( Unconnected -> WFConnection ) >>> [ 158.960577] block drbd2: role( Secondary -> Primary ) >>> [ 158.960815] block drbd2: new current UUID >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>> [ 162.686990] block drbd2: Handshake successful: Agreed network protocol >>> version 96 >>> [ 162.687183] block drbd2: conn( WFConnection -> WFReportParams ) >>> [ 162.687404] block drbd2: Starting asender thread (from drbd2_receiver >>> [6927]) >>> [ 162.687741] block drbd2: data-integrity-alg: <not-used> >>> [ 162.687930] block drbd2: drbd_sync_handshake: >>> [ 162.688057] block drbd2: self >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 >>> flags:0 >>> [ 162.688244] block drbd2: peer >>> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 >>> flags:0 >>> [ 162.688428] block drbd2: uuid_compare()=100 by rule 90 >>> [ 162.688544] block drbd2: helper command: /sbin/drbdadm >>> initial-split-brain minor-2 >>> [ 162.691332] block drbd2: helper command: /sbin/drbdadm >>> initial-split-brain minor-2 exit code 0 (0x0) >>> >>> to me it seems to be that it's promoting it too early, and I also wonder >>> why there is the >>> "new current UUID" stuff? >>> >>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6 >>> >>> could anybody please try to advice me? I'm sure I'm doing something stupid, >>> but can't figure out what... >>> >>> thanks a lot in advance >>> >>> with best regards >>> >>> nik >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> >> >> > > > >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org