Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
On 8 Oct 2014, at 9:20 am, Felix Zachlod wrote: > Hello Andrew, > > Am 06.10.2014 04:30, schrieb Andrew Beekhof: >> >> On 3 Oct 2014, at 5:07 am, Felix Zachlod wrote: >> >>> Am 02.10.2014 18:02, schrieb Digimer: On 02/10/14 02:44 AM, Felix Zachlod wrote: > I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! >>> >>> Are you referring to a special bug/ code change? I normally don't like >>> building all this stuff from source instead using the packages if there are >>> not very good reasons for it. I run some 1.1.7 debian base pacemaker >>> clusters for a long time now without any issue and I am sure that this >>> version seems to run very stable so as long as I am not facing a specific >>> problem with this version >> >> According to git, there are 1143 specific problems with 1.1.7 >> In total there have been 3815 commits and 5 releases in the last 2.5 years, >> we don't do all that for fun :-) > > I know that there have been a lot changes since this "ancient" version. But I > was just curios if there was something that in specific might be related to > my problem. I work tightly connected to software develepment in our company > and so i know that "newer" does not automatically mean "with less bugs" or > especially "with less bugs concerning ME". Particularly where the policy engine is concerned, it is actually true thanks to the 500+ regression tests we have. Also, there have definitely been improvements to master/slave in the last few releases. Check out the release notes, thats where I try to highlight the more interesting/important fixes. > Thats why I suspect "install the recent version" to be trial end error- which > might for sure help in some cases but does not enlight the corresponding > problem in any way. > >> On the other hand, if both sides think they have up-to-date data it might >> not be anything to do with pacemaker at all. > > That is what I suspect too. and why I passed this question to the drbd > mailing list, I am now nearly totally convinced that pacemaker isn't doing > anything wrong here cause the drbd RA sets a master score of 1000 on either > side which accoring to my constraints was the signal for pacemaker to promote. > > regards, Felix > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Hello Andrew, Am 06.10.2014 04:30, schrieb Andrew Beekhof: On 3 Oct 2014, at 5:07 am, Felix Zachlod wrote: Am 02.10.2014 18:02, schrieb Digimer: On 02/10/14 02:44 AM, Felix Zachlod wrote: I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! Are you referring to a special bug/ code change? I normally don't like building all this stuff from source instead using the packages if there are not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long time now without any issue and I am sure that this version seems to run very stable so as long as I am not facing a specific problem with this version According to git, there are 1143 specific problems with 1.1.7 In total there have been 3815 commits and 5 releases in the last 2.5 years, we don't do all that for fun :-) I know that there have been a lot changes since this "ancient" version. But I was just curios if there was something that in specific might be related to my problem. I work tightly connected to software develepment in our company and so i know that "newer" does not automatically mean "with less bugs" or especially "with less bugs concerning ME". Thats why I suspect "install the recent version" to be trial end error- which might for sure help in some cases but does not enlight the corresponding problem in any way. On the other hand, if both sides think they have up-to-date data it might not be anything to do with pacemaker at all. That is what I suspect too. and why I passed this question to the drbd mailing list, I am now nearly totally convinced that pacemaker isn't doing anything wrong here cause the drbd RA sets a master score of 1000 on either side which accoring to my constraints was the signal for pacemaker to promote. regards, Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
On 3 Oct 2014, at 5:07 am, Felix Zachlod wrote: > Am 02.10.2014 18:02, schrieb Digimer: >> On 02/10/14 02:44 AM, Felix Zachlod wrote: >>> I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 >> >> Please upgrade to 1.1.10+! >> > > Are you referring to a special bug/ code change? I normally don't like > building all this stuff from source instead using the packages if there are > not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters > for a long time now without any issue and I am sure that this version seems > to run very stable so as long as I am not facing a specific problem with this > version According to git, there are 1143 specific problems with 1.1.7 In total there have been 3815 commits and 5 releases in the last 2.5 years, we don't do all that for fun :-) Also, since our resources are severely constrained, "get something recent" helps us focus our efforts on a limited number of recent resources (of which 1.1.7 isn't one). Its great when something older is working for people, but we generally leave "long term support" to vendors like Red Hat and SUSE. On the other hand, if both sides think they have up-to-date data it might not be anything to do with pacemaker at all. > I'd prefer sticking to it rather than putting brand new stuff from source > together which might face other compatibility issues later on. > > > I am nearly sure that I found a hint to the problem: > > adjust_master_score (string, [5 10 1000 1]): master score adjustments >Space separated list of four master score adjustments for different > scenarios: > - only access to 'consistent' data > - only remote access to 'uptodate' data > - currently Secondary, local access to 'uptodate' data, but remote is > unknown > > This is from the drbd resource agent's meta data. > > As you can see the RA will report a master score of 1000 if it is secondary > and (thinks) it has up to date data. According to the logs it is reporting > 1000 though... I set a location rule with a score of -1001 for the Master > role and finally Pacemaker is waiting to promote the nodes to Master till the > next monitor action when it notices until the nodes are connected and synced > and report a MS of 1. What is interesting to me is > > a) why do both drbd nodes think they have uptodate data when coming back > online- at least one should know that it has been disconnected when another > node was still up and consider that data might have been changed in the > meantime. and in case I am rebooting a single node it can almost be sure that > it has only "consistent" data cause the other side was still primary when > shutting down this one > > b) why does obviously nobody face this problem as it should behave like this > in any primary primary cluster > > but I think I will try passing this on to the drbd mailing list too. > > regards, Felix > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Am 02.10.2014 18:02, schrieb Digimer: On 02/10/14 02:44 AM, Felix Zachlod wrote: I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! Are you referring to a special bug/ code change? I normally don't like building all this stuff from source instead using the packages if there are not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long time now without any issue and I am sure that this version seems to run very stable so as long as I am not facing a specific problem with this version I'd prefer sticking to it rather than putting brand new stuff from source together which might face other compatibility issues later on. I am nearly sure that I found a hint to the problem: adjust_master_score (string, [5 10 1000 1]): master score adjustments Space separated list of four master score adjustments for different scenarios: - only access to 'consistent' data - only remote access to 'uptodate' data - currently Secondary, local access to 'uptodate' data, but remote is unknown This is from the drbd resource agent's meta data. As you can see the RA will report a master score of 1000 if it is secondary and (thinks) it has up to date data. According to the logs it is reporting 1000 though... I set a location rule with a score of -1001 for the Master role and finally Pacemaker is waiting to promote the nodes to Master till the next monitor action when it notices until the nodes are connected and synced and report a MS of 1. What is interesting to me is a) why do both drbd nodes think they have uptodate data when coming back online- at least one should know that it has been disconnected when another node was still up and consider that data might have been changed in the meantime. and in case I am rebooting a single node it can almost be sure that it has only "consistent" data cause the other side was still primary when shutting down this one b) why does obviously nobody face this problem as it should behave like this in any primary primary cluster but I think I will try passing this on to the drbd mailing list too. regards, Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
On 02/10/14 02:44 AM, Felix Zachlod wrote: I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7 Please upgrade to 1.1.10+! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Am 02.10.2014 09:01, schrieb Felix Zachlod: Am 02.10.2014 08:44, schrieb Felix Zachlod: Am 01.10.2014 20:46, schrieb Digimer: At some point along the way, both nodes were Primary while not connected, even if for just a moment. Your log snippet above shows the results of this break, they do not appear to speak to the break itself. Even easier to reproduce is the problem when I try stop stop a drbd resource and later restart it this always leads to a split brain. And another thing too add which might be related: I just tried to configure the resource's target-role to "Started" or "Slave" But both sides stay in Master state... which is unexpected for me too. Which is wrong again... sorry for that. If I configure "Started" they stay in Master and if I configure Slave they stay in slave. If you stop the resource crm resource stop It reconfigures target-role to Stopped and if you crm resource start it configures target role Started which lets both come up in primary. regards again. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Am 02.10.2014 08:44, schrieb Felix Zachlod: Am 01.10.2014 20:46, schrieb Digimer: At some point along the way, both nodes were Primary while not connected, even if for just a moment. Your log snippet above shows the results of this break, they do not appear to speak to the break itself. Even easier to reproduce is the problem when I try stop stop a drbd resource and later restart it this always leads to a split brain. And another thing too add which might be related: I just tried to configure the resource's target-role to "Started" or "Slave" But both sides stay in Master state... which is unexpected for me too. regards, Felix ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Am 01.10.2014 20:46, schrieb Digimer: At some point along the way, both nodes were Primary while not connected, even if for just a moment. Your log snippet above shows the results of this break, they do not appear to speak to the break itself. Even easier to reproduce is the problem when I try stop stop a drbd resource and later restart it this always leads to a split brain. this is the log from the one side: Oct 2 08:11:46 storage-test-d kernel: [44936.343453] drbd testdata2: asender terminated Oct 2 08:11:46 storage-test-d kernel: [44936.343457] drbd testdata2: Terminating drbd_a_testdata Oct 2 08:11:46 storage-test-d kernel: [44936.362103] drbd testdata2: conn( TearDown -> Disconnecting ) Oct 2 08:11:47 storage-test-d kernel: [44936.450052] drbd testdata2: Connection closed Oct 2 08:11:47 storage-test-d kernel: [44936.450070] drbd testdata2: conn( Disconnecting -> StandAlone ) Oct 2 08:11:47 storage-test-d kernel: [44936.450074] drbd testdata2: receiver terminated Oct 2 08:11:47 storage-test-d kernel: [44936.450081] drbd testdata2: Terminating drbd_r_testdata Oct 2 08:11:47 storage-test-d kernel: [44936.450104] block drbd11: disk( UpToDate -> Failed ) Oct 2 08:11:47 storage-test-d kernel: [44936.514071] block drbd11: bitmap WRITE of 0 pages took 0 jiffies Oct 2 08:11:47 storage-test-d kernel: [44936.514078] block drbd11: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 2 08:11:47 storage-test-d kernel: [44936.514088] block drbd11: disk( Failed -> Diskless ) Oct 2 08:11:47 storage-test-d kernel: [44936.514793] block drbd11: drbd_bm_resize called with capacity == 0 Oct 2 08:11:47 storage-test-d kernel: [44936.515461] drbd testdata2: Terminating drbd_w_testdata Oct 2 08:12:16 storage-test-d rsyslogd-2177: imuxsock lost 124 messages from pid 2748 due to rate-limiting Oct 2 08:13:06 storage-test-d kernel: [45016.120378] drbd testdata2: Starting worker thread (from drbdsetup-84 [10353]) Oct 2 08:13:06 storage-test-d kernel: [45016.121012] block drbd11: disk( Diskless -> Attaching ) Oct 2 08:13:06 storage-test-d kernel: [45016.121812] drbd testdata2: Method to ensure write ordering: drain Oct 2 08:13:06 storage-test-d kernel: [45016.121817] block drbd11: max BIO size = 1048576 Oct 2 08:13:06 storage-test-d kernel: [45016.121825] block drbd11: drbd_bm_resize called with capacity == 838835128 Oct 2 08:13:06 storage-test-d kernel: [45016.127192] block drbd11: resync bitmap: bits=104854391 words=1638350 pages=3200 Oct 2 08:13:06 storage-test-d kernel: [45016.127199] block drbd11: size = 400 GB (419417564 KB) Oct 2 08:13:06 storage-test-d kernel: [45016.321361] block drbd11: recounting of set bits took additional 2 jiffies Oct 2 08:13:06 storage-test-d kernel: [45016.321369] block drbd11: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 2 08:13:06 storage-test-d kernel: [45016.321382] block drbd11: disk( Attaching -> UpToDate ) Oct 2 08:13:06 storage-test-d kernel: [45016.321388] block drbd11: attached to UUIDs 28A688FAC06E2662::0EABC2724124755C:0EAAC2724124755C Oct 2 08:13:06 storage-test-d kernel: [45016.376555] drbd testdata2: conn( StandAlone -> Unconnected ) Oct 2 08:13:06 storage-test-d kernel: [45016.376634] drbd testdata2: Starting receiver thread (from drbd_w_testdata [10355]) Oct 2 08:13:06 storage-test-d kernel: [45016.376876] drbd testdata2: receiver (re)started Oct 2 08:13:06 storage-test-d kernel: [45016.376897] drbd testdata2: conn( Unconnected -> WFConnection ) Oct 2 08:13:07 storage-test-d rsyslogd-2177: imuxsock begins to drop messages from pid 2748 due to rate-limiting Oct 2 08:13:07 storage-test-d kernel: [45016.707045] block drbd11: role( Secondary -> Primary ) Oct 2 08:13:07 storage-test-d kernel: [45016.729180] block drbd11: new current UUID C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C Oct 2 08:13:07 storage-test-d kernel: [45016.876920] drbd testdata2: Handshake successful: Agreed network protocol version 101 Oct 2 08:13:07 storage-test-d kernel: [45016.876926] drbd testdata2: Agreed to support TRIM on protocol level Oct 2 08:13:07 storage-test-d kernel: [45016.876999] drbd testdata2: conn( WFConnection -> WFReportParams ) Oct 2 08:13:07 storage-test-d kernel: [45016.877013] drbd testdata2: Starting asender thread (from drbd_r_testdata [10376]) Oct 2 08:13:07 storage-test-d kernel: [45017.015220] block drbd11: drbd_sync_handshake: Oct 2 08:13:07 storage-test-d kernel: [45017.015228] block drbd11: self C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C bits:0 flags:0 Oct 2 08:13:07 storage-test-d kernel: [45017.015234] block drbd11: peer 7F282664519D49A1:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C bits:0 flags:0 Oct 2 08:13:07 storage-test-d kernel: [45017.015239] block drbd11: uuid_compare()=100 by rule 90 Oct 2 08:13:07 storage-test-d kernel: [45017.015247] block drbd11: helper command: /sbin/drbdadm initial-split-brai
Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
On 01/10/14 02:21 PM, Felix Zachlod wrote: Hello! I'm currently experimenting how a good DRBD Dual Primary Setup can be achieved with Pacemaker. I know all of the "you have to have good fencing in place things" ... that is just what I'am currently trying to test in my setup beside other things. Good fencing prevents split-brains. You really do need it, and now is the time to get it working. But even without a node crashing or the link dropping I already have the problem that I always run into a split brain situation when a node comes up that was e.g. in Standby before. Can you configure fencing, then reproduce? It will make debugging easier. Without fencing, things (drbd and pacemaker included) behave in somewhat unpredictable ways. With fencing, it should be easier to isolate the cause of the break. For example: I have both Nodes running connected both primary, everything is fine. I put one node into standby and DRBD is stopped on this node. I do some work, reboot the server and so on finally I try to re join the node in the cluster. Pacemaker is starting all resources and finally DRBD drops the connection informing me about a split brain. In the log this looks like: Oct 1 19:44:42 storage-test-d kernel: [ 111.138512] block drbd10: disk( Diskless -> Attaching ) Oct 1 19:44:42 storage-test-d kernel: [ 111.139283] drbd testdata1: Method to ensure write ordering: drain Oct 1 19:44:42 storage-test-d kernel: [ 111.139288] block drbd10: max BIO size = 1048576 Oct 1 19:44:42 storage-test-d kernel: [ 111.139296] block drbd10: drbd_bm_resize called with capacity == 838835128 Oct 1 19:44:42 storage-test-d kernel: [ 111.144488] block drbd10: resync bitmap: bits=104854391 words=1638350 pages=3200 Oct 1 19:44:42 storage-test-d kernel: [ 111.144494] block drbd10: size = 400 GB (419417564 KB) Oct 1 19:44:42 storage-test-d kernel: [ 111.289327] block drbd10: recounting of set bits took additional 3 jiffies Oct 1 19:44:42 storage-test-d kernel: [ 111.289334] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 1 19:44:42 storage-test-d kernel: [ 111.289346] block drbd10: disk( Attaching -> UpToDate ) Oct 1 19:44:42 storage-test-d kernel: [ 111.289352] block drbd10: attached to UUIDs A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1 Oct 1 19:44:42 storage-test-d kernel: [ 111.321564] drbd testdata2: conn( StandAlone -> Unconnected ) Oct 1 19:44:42 storage-test-d kernel: [ 111.321628] drbd testdata2: Starting receiver thread (from drbd_w_testdata [3211]) Oct 1 19:44:42 storage-test-d kernel: [ 111.321794] drbd testdata2: receiver (re)started Oct 1 19:44:42 storage-test-d kernel: [ 111.321822] drbd testdata2: conn( Unconnected -> WFConnection ) Oct 1 19:44:42 storage-test-d kernel: [ 111.337708] drbd testdata1: conn( StandAlone -> Unconnected ) Oct 1 19:44:42 storage-test-d kernel: [ 111.337764] drbd testdata1: Starting receiver thread (from drbd_w_testdata [3215]) Oct 1 19:44:42 storage-test-d kernel: [ 111.337904] drbd testdata1: receiver (re)started Oct 1 19:44:42 storage-test-d kernel: [ 111.337927] drbd testdata1: conn( Unconnected -> WFConnection ) Oct 1 19:44:43 storage-test-d kernel: [ 111.808897] block drbd10: role( Secondary -> Primary ) Oct 1 19:44:43 storage-test-d kernel: [ 111.810883] block drbd11: role( Secondary -> Primary ) Oct 1 19:44:43 storage-test-d kernel: [ 111.820040] drbd testdata2: Handshake successful: Agreed network protocol version 101 Oct 1 19:44:43 storage-test-d kernel: [ 111.820046] drbd testdata2: Agreed to support TRIM on protocol level Oct 1 19:44:43 storage-test-d kernel: [ 111.823292] block drbd10: new current UUID 8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1 Oct 1 19:44:43 storage-test-d kernel: [ 111.836096] drbd testdata1: Handshake successful: Agreed network protocol version 101 Oct 1 19:44:43 storage-test-d kernel: [ 111.836108] drbd testdata1: Agreed to support TRIM on protocol level Oct 1 19:44:43 storage-test-d kernel: [ 111.848917] block drbd11: new current UUID 69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D Oct 1 19:44:43 storage-test-d kernel: [ 111.871100] drbd testdata2: conn( WFConnection -> WFReportParams ) Oct 1 19:44:43 storage-test-d kernel: [ 111.871108] drbd testdata2: Starting asender thread (from drbd_r_testdata [3249]) Oct 1 19:44:43 storage-test-d kernel: [ 111.909687] drbd testdata1: conn( WFConnection -> WFReportParams ) Oct 1 19:44:43 storage-test-d kernel: [ 111.909695] drbd testdata1: Starting asender thread (from drbd_r_testdata [3270]) Oct 1 19:44:43 storage-test-d kernel: [ 111.943986] drbd testdata2: meta connection shut down by peer. Oct 1 19:44:43 storage-test-d kernel: [ 111.944063] drbd testdata2: conn( WFReportParams -> NetworkFailure ) Oct 1 19:44:43 storage-test-d kernel: [ 111.944067] drbd testdata2: asender terminated Oct 1 19:44:43 storage-test-d kernel: [ 111.944070] drbd testdat
[Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains
Hello! I'm currently experimenting how a good DRBD Dual Primary Setup can be achieved with Pacemaker. I know all of the "you have to have good fencing in place things" ... that is just what I'am currently trying to test in my setup beside other things. But even without a node crashing or the link dropping I already have the problem that I always run into a split brain situation when a node comes up that was e.g. in Standby before. For example: I have both Nodes running connected both primary, everything is fine. I put one node into standby and DRBD is stopped on this node. I do some work, reboot the server and so on finally I try to re join the node in the cluster. Pacemaker is starting all resources and finally DRBD drops the connection informing me about a split brain. In the log this looks like: Oct 1 19:44:42 storage-test-d kernel: [ 111.138512] block drbd10: disk( Diskless -> Attaching ) Oct 1 19:44:42 storage-test-d kernel: [ 111.139283] drbd testdata1: Method to ensure write ordering: drain Oct 1 19:44:42 storage-test-d kernel: [ 111.139288] block drbd10: max BIO size = 1048576 Oct 1 19:44:42 storage-test-d kernel: [ 111.139296] block drbd10: drbd_bm_resize called with capacity == 838835128 Oct 1 19:44:42 storage-test-d kernel: [ 111.144488] block drbd10: resync bitmap: bits=104854391 words=1638350 pages=3200 Oct 1 19:44:42 storage-test-d kernel: [ 111.144494] block drbd10: size = 400 GB (419417564 KB) Oct 1 19:44:42 storage-test-d kernel: [ 111.289327] block drbd10: recounting of set bits took additional 3 jiffies Oct 1 19:44:42 storage-test-d kernel: [ 111.289334] block drbd10: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Oct 1 19:44:42 storage-test-d kernel: [ 111.289346] block drbd10: disk( Attaching -> UpToDate ) Oct 1 19:44:42 storage-test-d kernel: [ 111.289352] block drbd10: attached to UUIDs A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1 Oct 1 19:44:42 storage-test-d kernel: [ 111.321564] drbd testdata2: conn( StandAlone -> Unconnected ) Oct 1 19:44:42 storage-test-d kernel: [ 111.321628] drbd testdata2: Starting receiver thread (from drbd_w_testdata [3211]) Oct 1 19:44:42 storage-test-d kernel: [ 111.321794] drbd testdata2: receiver (re)started Oct 1 19:44:42 storage-test-d kernel: [ 111.321822] drbd testdata2: conn( Unconnected -> WFConnection ) Oct 1 19:44:42 storage-test-d kernel: [ 111.337708] drbd testdata1: conn( StandAlone -> Unconnected ) Oct 1 19:44:42 storage-test-d kernel: [ 111.337764] drbd testdata1: Starting receiver thread (from drbd_w_testdata [3215]) Oct 1 19:44:42 storage-test-d kernel: [ 111.337904] drbd testdata1: receiver (re)started Oct 1 19:44:42 storage-test-d kernel: [ 111.337927] drbd testdata1: conn( Unconnected -> WFConnection ) Oct 1 19:44:43 storage-test-d kernel: [ 111.808897] block drbd10: role( Secondary -> Primary ) Oct 1 19:44:43 storage-test-d kernel: [ 111.810883] block drbd11: role( Secondary -> Primary ) Oct 1 19:44:43 storage-test-d kernel: [ 111.820040] drbd testdata2: Handshake successful: Agreed network protocol version 101 Oct 1 19:44:43 storage-test-d kernel: [ 111.820046] drbd testdata2: Agreed to support TRIM on protocol level Oct 1 19:44:43 storage-test-d kernel: [ 111.823292] block drbd10: new current UUID 8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1 Oct 1 19:44:43 storage-test-d kernel: [ 111.836096] drbd testdata1: Handshake successful: Agreed network protocol version 101 Oct 1 19:44:43 storage-test-d kernel: [ 111.836108] drbd testdata1: Agreed to support TRIM on protocol level Oct 1 19:44:43 storage-test-d kernel: [ 111.848917] block drbd11: new current UUID 69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D Oct 1 19:44:43 storage-test-d kernel: [ 111.871100] drbd testdata2: conn( WFConnection -> WFReportParams ) Oct 1 19:44:43 storage-test-d kernel: [ 111.871108] drbd testdata2: Starting asender thread (from drbd_r_testdata [3249]) Oct 1 19:44:43 storage-test-d kernel: [ 111.909687] drbd testdata1: conn( WFConnection -> WFReportParams ) Oct 1 19:44:43 storage-test-d kernel: [ 111.909695] drbd testdata1: Starting asender thread (from drbd_r_testdata [3270]) Oct 1 19:44:43 storage-test-d kernel: [ 111.943986] drbd testdata2: meta connection shut down by peer. Oct 1 19:44:43 storage-test-d kernel: [ 111.944063] drbd testdata2: conn( WFReportParams -> NetworkFailure ) Oct 1 19:44:43 storage-test-d kernel: [ 111.944067] drbd testdata2: asender terminated Oct 1 19:44:43 storage-test-d kernel: [ 111.944070] drbd testdata2: Terminating drbd_a_testdata Oct 1 19:44:43 storage-test-d kernel: [ 111.988005] drbd testdata1: meta connection shut down by peer. Oct 1 19:44:43 storage-test-d kernel: [ 111.988089] drbd testdata1: conn( WFReportParams -> NetworkFailure ) Oct 1 19:44:43 storage-test-d kernel: [ 111.988094] drbd testdata1: asender terminated