On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz <andr...@hastexo.com> wrote: > On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich > <nikola.cipr...@linuxbox.cz> wrote: >> Hello Andreas, >>> Why not using the RA that comes with the resource-agent package? >> well, I've historically used my scripts, haven't even noticed when LVM >> resource appeared.. I switched to it now.., thanks for the hint.. >>> this "become-primary-on" was never activated? >> nope. >> >> >>> Is the drbd init script deactivated on system boot? Cluster logs should >>> give more insights .... >> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs, >> rebooted both nodes, checked drbd ain't started and started corosync. >> result is here: >> http://nelide.cz/nik/logs.tar.gz > > It really really looks like Pacemaker is too fast when promoting to > primary ... before the connection to the already up second node can be > established.
Do you mean we're violating a constraint? Or is it a problem of the RA returning too soon? > I see in your logs you have DRBD 8.3.13 userland but > 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module > ... there have been fixes that look like addressing this problem. > > Another quick-fix, that should also do: add a start-delay of some > seconds to the start operation of DRBD > > ... or fix your after-split-brain policies to automatically solve this > special type of split-brain (with 0 blocks to sync). > > Best Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> >> thanks for Your time. >> n. >> >> >>> >>> Regards, >>> Andreas >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>> > >>> > thanks a lot in advance >>> > >>> > nik >>> > >>> > >>> > On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote: >>> >> On 07/02/2012 11:49 PM, Nikola Ciprich wrote: >>> >>> hello, >>> >>> >>> >>> I'm trying to solve quite mysterious problem here.. >>> >>> I've got new cluster with bunch of SAS disks for testing purposes. >>> >>> I've configured DRBDs (in primary/primary configuration) >>> >>> >>> >>> when I start drbd using drbdadm, it get's up nicely (both nodes >>> >>> are Primary, connected). >>> >>> however when I start it using corosync, I always get split-brain, >>> >>> although >>> >>> there are no data written, no network disconnection, anything.. >>> >> >>> >> your full drbd and Pacemaker configuration please ... some snippets from >>> >> something are very seldom helpful ... >>> >> >>> >> Regards, >>> >> Andreas >>> >> >>> >> -- >>> >> Need help with Pacemaker? >>> >> http://www.hastexo.com/now >>> >> >>> >>> >>> >>> here's drbd resource config: >>> >>> primitive drbd-sas0 ocf:linbit:drbd \ >>> >>> params drbd_resource="drbd-sas0" \ >>> >>> operations $id="drbd-sas0-operations" \ >>> >>> op start interval="0" timeout="240s" \ >>> >>> op stop interval="0" timeout="200s" \ >>> >>> op promote interval="0" timeout="200s" \ >>> >>> op demote interval="0" timeout="200s" \ >>> >>> op monitor interval="179s" role="Master" timeout="150s" \ >>> >>> op monitor interval="180s" role="Slave" timeout="150s" >>> >>> >>> >>> ms ms-drbd-sas0 drbd-sas0 \ >>> >>> meta clone-max="2" clone-node-max="1" master-max="2" >>> >>> master-node-max="1" notify="true" globally-unique="false" >>> >>> interleave="true" target-role="Started" >>> >>> >>> >>> >>> >>> here's the dmesg output when pacemaker tries to promote drbd, causing >>> >>> the splitbrain: >>> >>> [ 157.646292] block drbd2: Starting worker thread (from drbdsetup >>> >>> [6892]) >>> >>> [ 157.646539] block drbd2: disk( Diskless -> Attaching ) >>> >>> [ 157.650364] block drbd2: Found 1 transactions (1 active extents) in >>> >>> activity log. >>> >>> [ 157.650560] block drbd2: Method to ensure write ordering: drain >>> >>> [ 157.650688] block drbd2: drbd_bm_resize called with capacity == >>> >>> 584667688 >>> >>> [ 157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 >>> >>> pages=2231 >>> >>> [ 157.653760] block drbd2: size = 279 GB (292333844 KB) >>> >>> [ 157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies >>> >>> [ 157.673722] block drbd2: recounting of set bits took additional 2 >>> >>> jiffies >>> >>> [ 157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk >>> >>> bit-map. >>> >>> [ 157.673972] block drbd2: disk( Attaching -> UpToDate ) >>> >>> [ 157.674100] block drbd2: attached to UUIDs >>> >>> 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263 >>> >>> [ 157.685539] block drbd2: conn( StandAlone -> Unconnected ) >>> >>> [ 157.685704] block drbd2: Starting receiver thread (from drbd2_worker >>> >>> [6893]) >>> >>> [ 157.685928] block drbd2: receiver (re)started >>> >>> [ 157.686071] block drbd2: conn( Unconnected -> WFConnection ) >>> >>> [ 158.960577] block drbd2: role( Secondary -> Primary ) >>> >>> [ 158.960815] block drbd2: new current UUID >>> >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>> >>> [ 162.686990] block drbd2: Handshake successful: Agreed network >>> >>> protocol version 96 >>> >>> [ 162.687183] block drbd2: conn( WFConnection -> WFReportParams ) >>> >>> [ 162.687404] block drbd2: Starting asender thread (from >>> >>> drbd2_receiver [6927]) >>> >>> [ 162.687741] block drbd2: data-integrity-alg: <not-used> >>> >>> [ 162.687930] block drbd2: drbd_sync_handshake: >>> >>> [ 162.688057] block drbd2: self >>> >>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>> >>> bits:0 flags:0 >>> >>> [ 162.688244] block drbd2: peer >>> >>> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 >>> >>> bits:0 flags:0 >>> >>> [ 162.688428] block drbd2: uuid_compare()=100 by rule 90 >>> >>> [ 162.688544] block drbd2: helper command: /sbin/drbdadm >>> >>> initial-split-brain minor-2 >>> >>> [ 162.691332] block drbd2: helper command: /sbin/drbdadm >>> >>> initial-split-brain minor-2 exit code 0 (0x0) >>> >>> >>> >>> to me it seems to be that it's promoting it too early, and I also >>> >>> wonder why there is the >>> >>> "new current UUID" stuff? >>> >>> >>> >>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6 >>> >>> >>> >>> could anybody please try to advice me? I'm sure I'm doing something >>> >>> stupid, but can't figure out what... >>> >>> >>> >>> thanks a lot in advance >>> >>> >>> >>> with best regards >>> >>> >>> >>> nik >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> >>> >>> Project Home: http://www.clusterlabs.org >>> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> >>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> >> >>> >> >>> >> >>> >> >>> >> >>> > >>> > >>> > >>> >> _______________________________________________ >>> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >> >>> >> Project Home: http://www.clusterlabs.org >>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> >> Bugs: http://bugs.clusterlabs.org >>> > >>> > >>> > >>> > >>> > _______________________________________________ >>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> > >>> > Project Home: http://www.clusterlabs.org >>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> > Bugs: http://bugs.clusterlabs.org >>> > >>> >>> >>> >>> >>> >> >> >> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> -- >> ------------------------------------- >> Ing. Nikola CIPRICH >> LinuxBox.cz, s.r.o. >> 28.rijna 168, 709 00 Ostrava >> >> tel.: +420 591 166 214 >> fax: +420 596 621 273 >> mobil: +420 777 093 799 >> www.linuxbox.cz >> >> mobil servis: +420 737 238 656 >> email servis: ser...@linuxbox.cz >> ------------------------------------- >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org