Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Andrew Beekhof

On 8 Oct 2014, at 9:20 am, Felix Zachlod  wrote:

> Hello Andrew,
> 
> Am 06.10.2014 04:30, schrieb Andrew Beekhof:
>> 
>> On 3 Oct 2014, at 5:07 am, Felix Zachlod  wrote:
>> 
>>> Am 02.10.2014 18:02, schrieb Digimer:
 On 02/10/14 02:44 AM, Felix Zachlod wrote:
> I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
 
 Please upgrade to 1.1.10+!
 
>>> 
>>> Are you referring to a special bug/ code change? I normally don't like 
>>> building all this stuff from source instead using the packages if there are 
>>> not very good reasons for it. I run some 1.1.7 debian base pacemaker 
>>> clusters for a long time now without any issue and I am sure that this 
>>> version seems to run very stable so as long as I am not facing a specific 
>>> problem with this version
>> 
>> According to git, there are 1143 specific problems with 1.1.7
>> In total there have been 3815 commits and 5 releases in the last 2.5 years, 
>> we don't do all that for fun :-)
> 
> I know that there have been a lot changes since this "ancient" version. But I 
> was just curios if there was something that in specific might be related to 
> my problem. I work tightly connected to software develepment in our company 
> and so i know that "newer" does not automatically mean "with less bugs" or 
> especially "with less bugs concerning ME".

Particularly where the policy engine is concerned, it is actually true thanks 
to the 500+ regression tests we have.
Also, there have definitely been improvements to master/slave in the last few 
releases.

Check out the release notes, thats where I try to highlight the more 
interesting/important fixes.

> Thats why I suspect "install the recent version" to be trial end error- which 
> might for sure help in some cases but does not enlight the corresponding 
> problem in any way.
> 
>> On the other hand, if both sides think they have up-to-date data it might 
>> not be anything to do with pacemaker at all.
> 
> That is what I suspect too. and why I passed this question to the drbd 
> mailing list, I am now nearly totally convinced that pacemaker isn't doing 
> anything wrong here cause the drbd RA sets a master score of 1000 on either 
> side which accoring to my constraints was the signal for pacemaker to promote.
> 
> regards, Felix
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Felix Zachlod

Hello Andrew,

Am 06.10.2014 04:30, schrieb Andrew Beekhof:


On 3 Oct 2014, at 5:07 am, Felix Zachlod  wrote:


Am 02.10.2014 18:02, schrieb Digimer:

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!



Are you referring to a special bug/ code change? I normally don't like building 
all this stuff from source instead using the packages if there are not very 
good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long 
time now without any issue and I am sure that this version seems to run very 
stable so as long as I am not facing a specific problem with this version


According to git, there are 1143 specific problems with 1.1.7
In total there have been 3815 commits and 5 releases in the last 2.5 years, we 
don't do all that for fun :-)


I know that there have been a lot changes since this "ancient" version. 
But I was just curios if there was something that in specific might be 
related to my problem. I work tightly connected to software develepment 
in our company and so i know that "newer" does not automatically mean 
"with less bugs" or especially "with less bugs concerning ME". Thats why 
I suspect "install the recent version" to be trial end error- which 
might for sure help in some cases but does not enlight the corresponding 
problem in any way.



On the other hand, if both sides think they have up-to-date data it might not 
be anything to do with pacemaker at all.


That is what I suspect too. and why I passed this question to the drbd 
mailing list, I am now nearly totally convinced that pacemaker isn't 
doing anything wrong here cause the drbd RA sets a master score of 1000 
on either side which accoring to my constraints was the signal for 
pacemaker to promote.


regards, Felix

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-05 Thread Andrew Beekhof

On 3 Oct 2014, at 5:07 am, Felix Zachlod  wrote:

> Am 02.10.2014 18:02, schrieb Digimer:
>> On 02/10/14 02:44 AM, Felix Zachlod wrote:
>>> I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
>> 
>> Please upgrade to 1.1.10+!
>> 
> 
> Are you referring to a special bug/ code change? I normally don't like 
> building all this stuff from source instead using the packages if there are 
> not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters 
> for a long time now without any issue and I am sure that this version seems 
> to run very stable so as long as I am not facing a specific problem with this 
> version

According to git, there are 1143 specific problems with 1.1.7
In total there have been 3815 commits and 5 releases in the last 2.5 years, we 
don't do all that for fun :-)

Also, since our resources are severely constrained, "get something recent" 
helps us focus our efforts on a limited number of recent resources (of which 
1.1.7 isn't one).
Its great when something older is working for people, but we generally leave 
"long term support" to vendors like Red Hat and SUSE.

On the other hand, if both sides think they have up-to-date data it might not 
be anything to do with pacemaker at all.

> I'd prefer sticking to it rather than putting brand new stuff from source 
> together which might face other compatibility issues later on.
> 
> 
> I am nearly sure that I found a hint to the problem:
> 
> adjust_master_score (string, [5 10 1000 1]): master score adjustments
>Space separated list of four master score adjustments for different 
> scenarios:
> - only access to 'consistent' data
> - only remote access to 'uptodate' data
> - currently Secondary, local access to 'uptodate' data, but remote is 
> unknown
> 
> This is from the drbd resource agent's meta data.
> 
> As you can see the RA will report a master score of 1000 if it is secondary 
> and (thinks) it has up to date data. According to the logs it is reporting 
> 1000 though... I set a location rule with a score of -1001 for the Master 
> role and finally Pacemaker is waiting to promote the nodes to Master till the 
> next monitor action when it notices until the nodes are connected and synced 
> and report a MS of 1. What is interesting to me is
> 
> a) why do both drbd nodes think they have uptodate data when coming back 
> online- at least one should know that it has been disconnected when another 
> node was still up and consider that data might have been changed in the 
> meantime. and in case I am rebooting a single node it can almost be sure that 
> it has only "consistent" data cause the other side was still primary when 
> shutting down this one
> 
> b) why does obviously nobody face this problem as it should behave like this 
> in any primary primary cluster
> 
> but I think I will try passing this on to the drbd mailing list too.
> 
> regards, Felix
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 18:02, schrieb Digimer:

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!



Are you referring to a special bug/ code change? I normally don't like 
building all this stuff from source instead using the packages if there 
are not very good reasons for it. I run some 1.1.7 debian base pacemaker 
clusters for a long time now without any issue and I am sure that this 
version seems to run very stable so as long as I am not facing a 
specific problem with this version I'd prefer sticking to it rather than 
putting brand new stuff from source together which might face other 
compatibility issues later on.



I am nearly sure that I found a hint to the problem:

adjust_master_score (string, [5 10 1000 1]): master score adjustments
Space separated list of four master score adjustments for different 
scenarios:

 - only access to 'consistent' data
 - only remote access to 'uptodate' data
 - currently Secondary, local access to 'uptodate' data, but remote 
is unknown


This is from the drbd resource agent's meta data.

As you can see the RA will report a master score of 1000 if it is 
secondary and (thinks) it has up to date data. According to the logs it 
is reporting 1000 though... I set a location rule with a score of -1001 
for the Master role and finally Pacemaker is waiting to promote the 
nodes to Master till the next monitor action when it notices until the 
nodes are connected and synced and report a MS of 1. What is 
interesting to me is


a) why do both drbd nodes think they have uptodate data when coming back 
online- at least one should know that it has been disconnected when 
another node was still up and consider that data might have been changed 
in the meantime. and in case I am rebooting a single node it can almost 
be sure that it has only "consistent" data cause the other side was 
still primary when shutting down this one


b) why does obviously nobody face this problem as it should behave like 
this in any primary primary cluster


but I think I will try passing this on to the drbd mailing list too.

regards, Felix



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Digimer

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 09:01, schrieb Felix Zachlod:

Am 02.10.2014 08:44, schrieb Felix Zachlod:

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd
resource and later restart it this always leads to a split brain.


And another thing too add which might be related: I just tried to
configure the resource's target-role to "Started" or "Slave"

But both sides stay in Master state... which is unexpected for me too.


Which is wrong again... sorry for that. If I configure "Started" they 
stay in Master and if I configure Slave they stay in slave. If you stop 
the resource


crm resource stop

It reconfigures target-role to Stopped and if you

crm resource start it configures target role Started which lets both 
come up in primary.


regards again.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 08:44, schrieb Felix Zachlod:

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd
resource and later restart it this always leads to a split brain.


And another thing too add which might be related: I just tried to 
configure the resource's target-role to "Started" or "Slave"


But both sides stay in Master state... which is unexpected for me too.

regards, Felix

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-01 Thread Felix Zachlod

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd 
resource and later restart it this always leads to a split brain.


this is the log from the one side:


Oct  2 08:11:46 storage-test-d kernel: [44936.343453] drbd testdata2: 
asender terminated
Oct  2 08:11:46 storage-test-d kernel: [44936.343457] drbd testdata2: 
Terminating drbd_a_testdata
Oct  2 08:11:46 storage-test-d kernel: [44936.362103] drbd testdata2: 
conn( TearDown -> Disconnecting )
Oct  2 08:11:47 storage-test-d kernel: [44936.450052] drbd testdata2: 
Connection closed
Oct  2 08:11:47 storage-test-d kernel: [44936.450070] drbd testdata2: 
conn( Disconnecting -> StandAlone )
Oct  2 08:11:47 storage-test-d kernel: [44936.450074] drbd testdata2: 
receiver terminated
Oct  2 08:11:47 storage-test-d kernel: [44936.450081] drbd testdata2: 
Terminating drbd_r_testdata
Oct  2 08:11:47 storage-test-d kernel: [44936.450104] block drbd11: 
disk( UpToDate -> Failed )
Oct  2 08:11:47 storage-test-d kernel: [44936.514071] block drbd11: 
bitmap WRITE of 0 pages took 0 jiffies
Oct  2 08:11:47 storage-test-d kernel: [44936.514078] block drbd11: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  2 08:11:47 storage-test-d kernel: [44936.514088] block drbd11: 
disk( Failed -> Diskless )
Oct  2 08:11:47 storage-test-d kernel: [44936.514793] block drbd11: 
drbd_bm_resize called with capacity == 0
Oct  2 08:11:47 storage-test-d kernel: [44936.515461] drbd testdata2: 
Terminating drbd_w_testdata
Oct  2 08:12:16 storage-test-d rsyslogd-2177: imuxsock lost 124 messages 
from pid 2748 due to rate-limiting
Oct  2 08:13:06 storage-test-d kernel: [45016.120378] drbd testdata2: 
Starting worker thread (from drbdsetup-84 [10353])
Oct  2 08:13:06 storage-test-d kernel: [45016.121012] block drbd11: 
disk( Diskless -> Attaching )
Oct  2 08:13:06 storage-test-d kernel: [45016.121812] drbd testdata2: 
Method to ensure write ordering: drain
Oct  2 08:13:06 storage-test-d kernel: [45016.121817] block drbd11: max 
BIO size = 1048576
Oct  2 08:13:06 storage-test-d kernel: [45016.121825] block drbd11: 
drbd_bm_resize called with capacity == 838835128
Oct  2 08:13:06 storage-test-d kernel: [45016.127192] block drbd11: 
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  2 08:13:06 storage-test-d kernel: [45016.127199] block drbd11: size 
= 400 GB (419417564 KB)
Oct  2 08:13:06 storage-test-d kernel: [45016.321361] block drbd11: 
recounting of set bits took additional 2 jiffies
Oct  2 08:13:06 storage-test-d kernel: [45016.321369] block drbd11: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  2 08:13:06 storage-test-d kernel: [45016.321382] block drbd11: 
disk( Attaching -> UpToDate )
Oct  2 08:13:06 storage-test-d kernel: [45016.321388] block drbd11: 
attached to UUIDs 
28A688FAC06E2662::0EABC2724124755C:0EAAC2724124755C
Oct  2 08:13:06 storage-test-d kernel: [45016.376555] drbd testdata2: 
conn( StandAlone -> Unconnected )
Oct  2 08:13:06 storage-test-d kernel: [45016.376634] drbd testdata2: 
Starting receiver thread (from drbd_w_testdata [10355])
Oct  2 08:13:06 storage-test-d kernel: [45016.376876] drbd testdata2: 
receiver (re)started
Oct  2 08:13:06 storage-test-d kernel: [45016.376897] drbd testdata2: 
conn( Unconnected -> WFConnection )
Oct  2 08:13:07 storage-test-d rsyslogd-2177: imuxsock begins to drop 
messages from pid 2748 due to rate-limiting
Oct  2 08:13:07 storage-test-d kernel: [45016.707045] block drbd11: 
role( Secondary -> Primary )
Oct  2 08:13:07 storage-test-d kernel: [45016.729180] block drbd11: new 
current UUID 
C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C
Oct  2 08:13:07 storage-test-d kernel: [45016.876920] drbd testdata2: 
Handshake successful: Agreed network protocol version 101
Oct  2 08:13:07 storage-test-d kernel: [45016.876926] drbd testdata2: 
Agreed to support TRIM on protocol level
Oct  2 08:13:07 storage-test-d kernel: [45016.876999] drbd testdata2: 
conn( WFConnection -> WFReportParams )
Oct  2 08:13:07 storage-test-d kernel: [45016.877013] drbd testdata2: 
Starting asender thread (from drbd_r_testdata [10376])
Oct  2 08:13:07 storage-test-d kernel: [45017.015220] block drbd11: 
drbd_sync_handshake:
Oct  2 08:13:07 storage-test-d kernel: [45017.015228] block drbd11: self 
C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C 
bits:0 flags:0
Oct  2 08:13:07 storage-test-d kernel: [45017.015234] block drbd11: peer 
7F282664519D49A1:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C 
bits:0 flags:0
Oct  2 08:13:07 storage-test-d kernel: [45017.015239] block drbd11: 
uuid_compare()=100 by rule 90
Oct  2 08:13:07 storage-test-d kernel: [45017.015247] block drbd11: 
helper command: /sbin/drbdadm initial-split-brai

Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-01 Thread Digimer

On 01/10/14 02:21 PM, Felix Zachlod wrote:

Hello!

I'm currently experimenting how a good DRBD Dual Primary Setup can be
achieved with Pacemaker. I know all of the "you have to have good
fencing in place things" ... that is just what I'am currently trying to
test in my setup beside other things.


Good fencing prevents split-brains. You really do need it, and now is 
the time to get it working.



But even without a node crashing or the link dropping I already have the
problem that I always run into a split brain situation when a node comes
up that was e.g. in Standby before.


Can you configure fencing, then reproduce? It will make debugging 
easier. Without fencing, things (drbd and pacemaker included) behave in 
somewhat unpredictable ways. With fencing, it should be easier to 
isolate the cause of the break.



For example: I have both Nodes running connected both primary,
everything is fine. I put one node into standby and DRBD is stopped on
this node.

I do some work, reboot the server and so on finally I try to re join the
node in the cluster. Pacemaker is starting all resources and finally
DRBD drops the connection informing me about a split brain.

In the log this looks like:

Oct  1 19:44:42 storage-test-d kernel: [  111.138512] block drbd10:
disk( Diskless -> Attaching )
Oct  1 19:44:42 storage-test-d kernel: [  111.139283] drbd testdata1:
Method to ensure write ordering: drain
Oct  1 19:44:42 storage-test-d kernel: [  111.139288] block drbd10: max
BIO size = 1048576
Oct  1 19:44:42 storage-test-d kernel: [  111.139296] block drbd10:
drbd_bm_resize called with capacity == 838835128
Oct  1 19:44:42 storage-test-d kernel: [  111.144488] block drbd10:
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  1 19:44:42 storage-test-d kernel: [  111.144494] block drbd10: size
= 400 GB (419417564 KB)
Oct  1 19:44:42 storage-test-d kernel: [  111.289327] block drbd10:
recounting of set bits took additional 3 jiffies
Oct  1 19:44:42 storage-test-d kernel: [  111.289334] block drbd10: 0 KB
(0 bits) marked out-of-sync by on disk bit-map.
Oct  1 19:44:42 storage-test-d kernel: [  111.289346] block drbd10:
disk( Attaching -> UpToDate )
Oct  1 19:44:42 storage-test-d kernel: [  111.289352] block drbd10:
attached to UUIDs
A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:42 storage-test-d kernel: [  111.321564] drbd testdata2:
conn( StandAlone -> Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.321628] drbd testdata2:
Starting receiver thread (from drbd_w_testdata [3211])
Oct  1 19:44:42 storage-test-d kernel: [  111.321794] drbd testdata2:
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.321822] drbd testdata2:
conn( Unconnected -> WFConnection )
Oct  1 19:44:42 storage-test-d kernel: [  111.337708] drbd testdata1:
conn( StandAlone -> Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.337764] drbd testdata1:
Starting receiver thread (from drbd_w_testdata [3215])
Oct  1 19:44:42 storage-test-d kernel: [  111.337904] drbd testdata1:
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.337927] drbd testdata1:
conn( Unconnected -> WFConnection )
Oct  1 19:44:43 storage-test-d kernel: [  111.808897] block drbd10:
role( Secondary -> Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.810883] block drbd11:
role( Secondary -> Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.820040] drbd testdata2:
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.820046] drbd testdata2:
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.823292] block drbd10: new
current UUID
8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:43 storage-test-d kernel: [  111.836096] drbd testdata1:
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.836108] drbd testdata1:
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.848917] block drbd11: new
current UUID
69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D
Oct  1 19:44:43 storage-test-d kernel: [  111.871100] drbd testdata2:
conn( WFConnection -> WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.871108] drbd testdata2:
Starting asender thread (from drbd_r_testdata [3249])
Oct  1 19:44:43 storage-test-d kernel: [  111.909687] drbd testdata1:
conn( WFConnection -> WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.909695] drbd testdata1:
Starting asender thread (from drbd_r_testdata [3270])
Oct  1 19:44:43 storage-test-d kernel: [  111.943986] drbd testdata2:
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.944063] drbd testdata2:
conn( WFReportParams -> NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.944067] drbd testdata2:
asender terminated
Oct  1 19:44:43 storage-test-d kernel: [  111.944070] drbd testdat

[Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-01 Thread Felix Zachlod

Hello!

I'm currently experimenting how a good DRBD Dual Primary Setup can be 
achieved with Pacemaker. I know all of the "you have to have good 
fencing in place things" ... that is just what I'am currently trying to 
test in my setup beside other things.


But even without a node crashing or the link dropping I already have the 
problem that I always run into a split brain situation when a node comes 
up that was e.g. in Standby before.


For example: I have both Nodes running connected both primary, 
everything is fine. I put one node into standby and DRBD is stopped on 
this node.


I do some work, reboot the server and so on finally I try to re join the 
node in the cluster. Pacemaker is starting all resources and finally 
DRBD drops the connection informing me about a split brain.


In the log this looks like:

Oct  1 19:44:42 storage-test-d kernel: [  111.138512] block drbd10: 
disk( Diskless -> Attaching )
Oct  1 19:44:42 storage-test-d kernel: [  111.139283] drbd testdata1: 
Method to ensure write ordering: drain
Oct  1 19:44:42 storage-test-d kernel: [  111.139288] block drbd10: max 
BIO size = 1048576
Oct  1 19:44:42 storage-test-d kernel: [  111.139296] block drbd10: 
drbd_bm_resize called with capacity == 838835128
Oct  1 19:44:42 storage-test-d kernel: [  111.144488] block drbd10: 
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  1 19:44:42 storage-test-d kernel: [  111.144494] block drbd10: size 
= 400 GB (419417564 KB)
Oct  1 19:44:42 storage-test-d kernel: [  111.289327] block drbd10: 
recounting of set bits took additional 3 jiffies
Oct  1 19:44:42 storage-test-d kernel: [  111.289334] block drbd10: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  1 19:44:42 storage-test-d kernel: [  111.289346] block drbd10: 
disk( Attaching -> UpToDate )
Oct  1 19:44:42 storage-test-d kernel: [  111.289352] block drbd10: 
attached to UUIDs 
A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:42 storage-test-d kernel: [  111.321564] drbd testdata2: 
conn( StandAlone -> Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.321628] drbd testdata2: 
Starting receiver thread (from drbd_w_testdata [3211])
Oct  1 19:44:42 storage-test-d kernel: [  111.321794] drbd testdata2: 
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.321822] drbd testdata2: 
conn( Unconnected -> WFConnection )
Oct  1 19:44:42 storage-test-d kernel: [  111.337708] drbd testdata1: 
conn( StandAlone -> Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.337764] drbd testdata1: 
Starting receiver thread (from drbd_w_testdata [3215])
Oct  1 19:44:42 storage-test-d kernel: [  111.337904] drbd testdata1: 
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.337927] drbd testdata1: 
conn( Unconnected -> WFConnection )
Oct  1 19:44:43 storage-test-d kernel: [  111.808897] block drbd10: 
role( Secondary -> Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.810883] block drbd11: 
role( Secondary -> Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.820040] drbd testdata2: 
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.820046] drbd testdata2: 
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.823292] block drbd10: new 
current UUID 
8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:43 storage-test-d kernel: [  111.836096] drbd testdata1: 
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.836108] drbd testdata1: 
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.848917] block drbd11: new 
current UUID 
69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D
Oct  1 19:44:43 storage-test-d kernel: [  111.871100] drbd testdata2: 
conn( WFConnection -> WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.871108] drbd testdata2: 
Starting asender thread (from drbd_r_testdata [3249])
Oct  1 19:44:43 storage-test-d kernel: [  111.909687] drbd testdata1: 
conn( WFConnection -> WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.909695] drbd testdata1: 
Starting asender thread (from drbd_r_testdata [3270])
Oct  1 19:44:43 storage-test-d kernel: [  111.943986] drbd testdata2: 
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.944063] drbd testdata2: 
conn( WFReportParams -> NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.944067] drbd testdata2: 
asender terminated
Oct  1 19:44:43 storage-test-d kernel: [  111.944070] drbd testdata2: 
Terminating drbd_a_testdata
Oct  1 19:44:43 storage-test-d kernel: [  111.988005] drbd testdata1: 
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.988089] drbd testdata1: 
conn( WFReportParams -> NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.988094] drbd testdata1: 
asender terminated