On 07/11/2012 09:23 AM, Nikola Ciprich wrote: >>> It really really looks like Pacemaker is too fast when promoting to >>> primary ... before the connection to the already up second node can be >>> established. >> >> Do you mean we're violating a constraint? >> Or is it a problem of the RA returning too soon? > dunno, I tried older drbd userspaces to check if it's not problem > of newer RA, to no avail... > >> >>> I see in your logs you have DRBD 8.3.13 userland but >>> 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module >>> ... there have been fixes that look like addressing this problem. > tried 8.3.13 userspace + 8.3.13 module (on top of 3.0.36 kernel), > unfortunately same result.. > >>> >>> Another quick-fix, that should also do: add a start-delay of some >>> seconds to the start operation of DRBD >>> >>> ... or fix your after-split-brain policies to automatically solve this >>> special type of split-brain (with 0 blocks to sync). > I'll try that, although I'd not like to use this for production :)
Well, I'd expect that to be safer as your current configuration ... discard-zero-changes will never overwrite data automatically .... have you tried adding the start-delay to DRBD start operation? I'm curious if that is already sufficient for your problem. Regards, Andreas > >>> >>> Best Regards, >>> Andreas >>> >>> -- >>> Need help with Pacemaker? >>> http://www.hastexo.com/now >>> >>>> >>>> thanks for Your time. >>>> n. >>>> >>>> >>>>> >>>>> Regards, >>>>> Andreas >>>>> >>>>> -- >>>>> Need help with Pacemaker? >>>>> http://www.hastexo.com/now >>>>> >>>>>> >>>>>> thanks a lot in advance >>>>>> >>>>>> nik >>>>>> >>>>>> >>>>>> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote: >>>>>>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote: >>>>>>>> hello, >>>>>>>> >>>>>>>> I'm trying to solve quite mysterious problem here.. >>>>>>>> I've got new cluster with bunch of SAS disks for testing purposes. >>>>>>>> I've configured DRBDs (in primary/primary configuration) >>>>>>>> >>>>>>>> when I start drbd using drbdadm, it get's up nicely (both nodes >>>>>>>> are Primary, connected). >>>>>>>> however when I start it using corosync, I always get split-brain, >>>>>>>> although >>>>>>>> there are no data written, no network disconnection, anything.. >>>>>>> >>>>>>> your full drbd and Pacemaker configuration please ... some snippets from >>>>>>> something are very seldom helpful ... >>>>>>> >>>>>>> Regards, >>>>>>> Andreas >>>>>>> >>>>>>> -- >>>>>>> Need help with Pacemaker? >>>>>>> http://www.hastexo.com/now >>>>>>> >>>>>>>> >>>>>>>> here's drbd resource config: >>>>>>>> primitive drbd-sas0 ocf:linbit:drbd \ >>>>>>>> params drbd_resource="drbd-sas0" \ >>>>>>>> operations $id="drbd-sas0-operations" \ >>>>>>>> op start interval="0" timeout="240s" \ >>>>>>>> op stop interval="0" timeout="200s" \ >>>>>>>> op promote interval="0" timeout="200s" \ >>>>>>>> op demote interval="0" timeout="200s" \ >>>>>>>> op monitor interval="179s" role="Master" timeout="150s" \ >>>>>>>> op monitor interval="180s" role="Slave" timeout="150s" >>>>>>>> >>>>>>>> ms ms-drbd-sas0 drbd-sas0 \ >>>>>>>> meta clone-max="2" clone-node-max="1" master-max="2" >>>>>>>> master-node-max="1" notify="true" globally-unique="false" >>>>>>>> interleave="true" target-role="Started" >>>>>>>> >>>>>>>> >>>>>>>> here's the dmesg output when pacemaker tries to promote drbd, causing >>>>>>>> the splitbrain: >>>>>>>> [ 157.646292] block drbd2: Starting worker thread (from drbdsetup >>>>>>>> [6892]) >>>>>>>> [ 157.646539] block drbd2: disk( Diskless -> Attaching ) >>>>>>>> [ 157.650364] block drbd2: Found 1 transactions (1 active extents) in >>>>>>>> activity log. >>>>>>>> [ 157.650560] block drbd2: Method to ensure write ordering: drain >>>>>>>> [ 157.650688] block drbd2: drbd_bm_resize called with capacity == >>>>>>>> 584667688 >>>>>>>> [ 157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 >>>>>>>> pages=2231 >>>>>>>> [ 157.653760] block drbd2: size = 279 GB (292333844 KB) >>>>>>>> [ 157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies >>>>>>>> [ 157.673722] block drbd2: recounting of set bits took additional 2 >>>>>>>> jiffies >>>>>>>> [ 157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on >>>>>>>> disk bit-map. >>>>>>>> [ 157.673972] block drbd2: disk( Attaching -> UpToDate ) >>>>>>>> [ 157.674100] block drbd2: attached to UUIDs >>>>>>>> 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263 >>>>>>>> [ 157.685539] block drbd2: conn( StandAlone -> Unconnected ) >>>>>>>> [ 157.685704] block drbd2: Starting receiver thread (from >>>>>>>> drbd2_worker [6893]) >>>>>>>> [ 157.685928] block drbd2: receiver (re)started >>>>>>>> [ 157.686071] block drbd2: conn( Unconnected -> WFConnection ) >>>>>>>> [ 158.960577] block drbd2: role( Secondary -> Primary ) >>>>>>>> [ 158.960815] block drbd2: new current UUID >>>>>>>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>>>>>>> [ 162.686990] block drbd2: Handshake successful: Agreed network >>>>>>>> protocol version 96 >>>>>>>> [ 162.687183] block drbd2: conn( WFConnection -> WFReportParams ) >>>>>>>> [ 162.687404] block drbd2: Starting asender thread (from >>>>>>>> drbd2_receiver [6927]) >>>>>>>> [ 162.687741] block drbd2: data-integrity-alg: <not-used> >>>>>>>> [ 162.687930] block drbd2: drbd_sync_handshake: >>>>>>>> [ 162.688057] block drbd2: self >>>>>>>> 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 >>>>>>>> bits:0 flags:0 >>>>>>>> [ 162.688244] block drbd2: peer >>>>>>>> 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 >>>>>>>> bits:0 flags:0 >>>>>>>> [ 162.688428] block drbd2: uuid_compare()=100 by rule 90 >>>>>>>> [ 162.688544] block drbd2: helper command: /sbin/drbdadm >>>>>>>> initial-split-brain minor-2 >>>>>>>> [ 162.691332] block drbd2: helper command: /sbin/drbdadm >>>>>>>> initial-split-brain minor-2 exit code 0 (0x0) >>>>>>>> >>>>>>>> to me it seems to be that it's promoting it too early, and I also >>>>>>>> wonder why there is the >>>>>>>> "new current UUID" stuff? >>>>>>>> >>>>>>>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6 >>>>>>>> >>>>>>>> could anybody please try to advice me? I'm sure I'm doing something >>>>>>>> stupid, but can't figure out what... >>>>>>>> >>>>>>>> thanks a lot in advance >>>>>>>> >>>>>>>> with best regards >>>>>>>> >>>>>>>> nik >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> -- >>>> ------------------------------------- >>>> Ing. Nikola CIPRICH >>>> LinuxBox.cz, s.r.o. >>>> 28.rijna 168, 709 00 Ostrava >>>> >>>> tel.: +420 591 166 214 >>>> fax: +420 596 621 273 >>>> mobil: +420 777 093 799 >>>> www.linuxbox.cz >>>> >>>> mobil servis: +420 737 238 656 >>>> email servis: ser...@linuxbox.cz >>>> ------------------------------------- >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org