Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Felix Zachlod

Hello Andrew,

Am 06.10.2014 04:30, schrieb Andrew Beekhof:


On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:


Am 02.10.2014 18:02, schrieb Digimer:

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!



Are you referring to a special bug/ code change? I normally don't like building 
all this stuff from source instead using the packages if there are not very 
good reasons for it. I run some 1.1.7 debian base pacemaker clusters for a long 
time now without any issue and I am sure that this version seems to run very 
stable so as long as I am not facing a specific problem with this version


According to git, there are 1143 specific problems with 1.1.7
In total there have been 3815 commits and 5 releases in the last 2.5 years, we 
don't do all that for fun :-)


I know that there have been a lot changes since this ancient version. 
But I was just curios if there was something that in specific might be 
related to my problem. I work tightly connected to software develepment 
in our company and so i know that newer does not automatically mean 
with less bugs or especially with less bugs concerning ME. Thats why 
I suspect install the recent version to be trial end error- which 
might for sure help in some cases but does not enlight the corresponding 
problem in any way.



On the other hand, if both sides think they have up-to-date data it might not 
be anything to do with pacemaker at all.


That is what I suspect too. and why I passed this question to the drbd 
mailing list, I am now nearly totally convinced that pacemaker isn't 
doing anything wrong here cause the drbd RA sets a master score of 1000 
on either side which accoring to my constraints was the signal for 
pacemaker to promote.


regards, Felix

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-07 Thread Andrew Beekhof

On 8 Oct 2014, at 9:20 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:

 Hello Andrew,
 
 Am 06.10.2014 04:30, schrieb Andrew Beekhof:
 
 On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:
 
 Am 02.10.2014 18:02, schrieb Digimer:
 On 02/10/14 02:44 AM, Felix Zachlod wrote:
 I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
 
 Please upgrade to 1.1.10+!
 
 
 Are you referring to a special bug/ code change? I normally don't like 
 building all this stuff from source instead using the packages if there are 
 not very good reasons for it. I run some 1.1.7 debian base pacemaker 
 clusters for a long time now without any issue and I am sure that this 
 version seems to run very stable so as long as I am not facing a specific 
 problem with this version
 
 According to git, there are 1143 specific problems with 1.1.7
 In total there have been 3815 commits and 5 releases in the last 2.5 years, 
 we don't do all that for fun :-)
 
 I know that there have been a lot changes since this ancient version. But I 
 was just curios if there was something that in specific might be related to 
 my problem. I work tightly connected to software develepment in our company 
 and so i know that newer does not automatically mean with less bugs or 
 especially with less bugs concerning ME.

Particularly where the policy engine is concerned, it is actually true thanks 
to the 500+ regression tests we have.
Also, there have definitely been improvements to master/slave in the last few 
releases.

Check out the release notes, thats where I try to highlight the more 
interesting/important fixes.

 Thats why I suspect install the recent version to be trial end error- which 
 might for sure help in some cases but does not enlight the corresponding 
 problem in any way.
 
 On the other hand, if both sides think they have up-to-date data it might 
 not be anything to do with pacemaker at all.
 
 That is what I suspect too. and why I passed this question to the drbd 
 mailing list, I am now nearly totally convinced that pacemaker isn't doing 
 anything wrong here cause the drbd RA sets a master score of 1000 on either 
 side which accoring to my constraints was the signal for pacemaker to promote.
 
 regards, Felix
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-05 Thread Andrew Beekhof

On 3 Oct 2014, at 5:07 am, Felix Zachlod fz.li...@sis-gmbh.info wrote:

 Am 02.10.2014 18:02, schrieb Digimer:
 On 02/10/14 02:44 AM, Felix Zachlod wrote:
 I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
 
 Please upgrade to 1.1.10+!
 
 
 Are you referring to a special bug/ code change? I normally don't like 
 building all this stuff from source instead using the packages if there are 
 not very good reasons for it. I run some 1.1.7 debian base pacemaker clusters 
 for a long time now without any issue and I am sure that this version seems 
 to run very stable so as long as I am not facing a specific problem with this 
 version

According to git, there are 1143 specific problems with 1.1.7
In total there have been 3815 commits and 5 releases in the last 2.5 years, we 
don't do all that for fun :-)

Also, since our resources are severely constrained, get something recent 
helps us focus our efforts on a limited number of recent resources (of which 
1.1.7 isn't one).
Its great when something older is working for people, but we generally leave 
long term support to vendors like Red Hat and SUSE.

On the other hand, if both sides think they have up-to-date data it might not 
be anything to do with pacemaker at all.

 I'd prefer sticking to it rather than putting brand new stuff from source 
 together which might face other compatibility issues later on.
 
 
 I am nearly sure that I found a hint to the problem:
 
 adjust_master_score (string, [5 10 1000 1]): master score adjustments
Space separated list of four master score adjustments for different 
 scenarios:
 - only access to 'consistent' data
 - only remote access to 'uptodate' data
 - currently Secondary, local access to 'uptodate' data, but remote is 
 unknown
 
 This is from the drbd resource agent's meta data.
 
 As you can see the RA will report a master score of 1000 if it is secondary 
 and (thinks) it has up to date data. According to the logs it is reporting 
 1000 though... I set a location rule with a score of -1001 for the Master 
 role and finally Pacemaker is waiting to promote the nodes to Master till the 
 next monitor action when it notices until the nodes are connected and synced 
 and report a MS of 1. What is interesting to me is
 
 a) why do both drbd nodes think they have uptodate data when coming back 
 online- at least one should know that it has been disconnected when another 
 node was still up and consider that data might have been changed in the 
 meantime. and in case I am rebooting a single node it can almost be sure that 
 it has only consistent data cause the other side was still primary when 
 shutting down this one
 
 b) why does obviously nobody face this problem as it should behave like this 
 in any primary primary cluster
 
 but I think I will try passing this on to the drbd mailing list too.
 
 regards, Felix
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd 
resource and later restart it this always leads to a split brain.


this is the log from the one side:


Oct  2 08:11:46 storage-test-d kernel: [44936.343453] drbd testdata2: 
asender terminated
Oct  2 08:11:46 storage-test-d kernel: [44936.343457] drbd testdata2: 
Terminating drbd_a_testdata
Oct  2 08:11:46 storage-test-d kernel: [44936.362103] drbd testdata2: 
conn( TearDown - Disconnecting )
Oct  2 08:11:47 storage-test-d kernel: [44936.450052] drbd testdata2: 
Connection closed
Oct  2 08:11:47 storage-test-d kernel: [44936.450070] drbd testdata2: 
conn( Disconnecting - StandAlone )
Oct  2 08:11:47 storage-test-d kernel: [44936.450074] drbd testdata2: 
receiver terminated
Oct  2 08:11:47 storage-test-d kernel: [44936.450081] drbd testdata2: 
Terminating drbd_r_testdata
Oct  2 08:11:47 storage-test-d kernel: [44936.450104] block drbd11: 
disk( UpToDate - Failed )
Oct  2 08:11:47 storage-test-d kernel: [44936.514071] block drbd11: 
bitmap WRITE of 0 pages took 0 jiffies
Oct  2 08:11:47 storage-test-d kernel: [44936.514078] block drbd11: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  2 08:11:47 storage-test-d kernel: [44936.514088] block drbd11: 
disk( Failed - Diskless )
Oct  2 08:11:47 storage-test-d kernel: [44936.514793] block drbd11: 
drbd_bm_resize called with capacity == 0
Oct  2 08:11:47 storage-test-d kernel: [44936.515461] drbd testdata2: 
Terminating drbd_w_testdata
Oct  2 08:12:16 storage-test-d rsyslogd-2177: imuxsock lost 124 messages 
from pid 2748 due to rate-limiting
Oct  2 08:13:06 storage-test-d kernel: [45016.120378] drbd testdata2: 
Starting worker thread (from drbdsetup-84 [10353])
Oct  2 08:13:06 storage-test-d kernel: [45016.121012] block drbd11: 
disk( Diskless - Attaching )
Oct  2 08:13:06 storage-test-d kernel: [45016.121812] drbd testdata2: 
Method to ensure write ordering: drain
Oct  2 08:13:06 storage-test-d kernel: [45016.121817] block drbd11: max 
BIO size = 1048576
Oct  2 08:13:06 storage-test-d kernel: [45016.121825] block drbd11: 
drbd_bm_resize called with capacity == 838835128
Oct  2 08:13:06 storage-test-d kernel: [45016.127192] block drbd11: 
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  2 08:13:06 storage-test-d kernel: [45016.127199] block drbd11: size 
= 400 GB (419417564 KB)
Oct  2 08:13:06 storage-test-d kernel: [45016.321361] block drbd11: 
recounting of set bits took additional 2 jiffies
Oct  2 08:13:06 storage-test-d kernel: [45016.321369] block drbd11: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  2 08:13:06 storage-test-d kernel: [45016.321382] block drbd11: 
disk( Attaching - UpToDate )
Oct  2 08:13:06 storage-test-d kernel: [45016.321388] block drbd11: 
attached to UUIDs 
28A688FAC06E2662::0EABC2724124755C:0EAAC2724124755C
Oct  2 08:13:06 storage-test-d kernel: [45016.376555] drbd testdata2: 
conn( StandAlone - Unconnected )
Oct  2 08:13:06 storage-test-d kernel: [45016.376634] drbd testdata2: 
Starting receiver thread (from drbd_w_testdata [10355])
Oct  2 08:13:06 storage-test-d kernel: [45016.376876] drbd testdata2: 
receiver (re)started
Oct  2 08:13:06 storage-test-d kernel: [45016.376897] drbd testdata2: 
conn( Unconnected - WFConnection )
Oct  2 08:13:07 storage-test-d rsyslogd-2177: imuxsock begins to drop 
messages from pid 2748 due to rate-limiting
Oct  2 08:13:07 storage-test-d kernel: [45016.707045] block drbd11: 
role( Secondary - Primary )
Oct  2 08:13:07 storage-test-d kernel: [45016.729180] block drbd11: new 
current UUID 
C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C
Oct  2 08:13:07 storage-test-d kernel: [45016.876920] drbd testdata2: 
Handshake successful: Agreed network protocol version 101
Oct  2 08:13:07 storage-test-d kernel: [45016.876926] drbd testdata2: 
Agreed to support TRIM on protocol level
Oct  2 08:13:07 storage-test-d kernel: [45016.876999] drbd testdata2: 
conn( WFConnection - WFReportParams )
Oct  2 08:13:07 storage-test-d kernel: [45016.877013] drbd testdata2: 
Starting asender thread (from drbd_r_testdata [10376])
Oct  2 08:13:07 storage-test-d kernel: [45017.015220] block drbd11: 
drbd_sync_handshake:
Oct  2 08:13:07 storage-test-d kernel: [45017.015228] block drbd11: self 
C58090DF57933525:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C 
bits:0 flags:0
Oct  2 08:13:07 storage-test-d kernel: [45017.015234] block drbd11: peer 
7F282664519D49A1:28A688FAC06E2662:0EABC2724124755C:0EAAC2724124755C 
bits:0 flags:0
Oct  2 08:13:07 storage-test-d kernel: [45017.015239] block drbd11: 
uuid_compare()=100 by rule 90
Oct  2 08:13:07 storage-test-d kernel: [45017.015247] block drbd11: 
helper command: /sbin/drbdadm initial-split-brain 

Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 08:44, schrieb Felix Zachlod:

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd
resource and later restart it this always leads to a split brain.


And another thing too add which might be related: I just tried to 
configure the resource's target-role to Started or Slave


But both sides stay in Master state... which is unexpected for me too.

regards, Felix

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 09:01, schrieb Felix Zachlod:

Am 02.10.2014 08:44, schrieb Felix Zachlod:

Am 01.10.2014 20:46, schrieb Digimer:


At some point along the way, both nodes were Primary while not
connected, even if for just a moment. Your log snippet above shows the
results of this break, they do not appear to speak to the break itself.


Even easier to reproduce is the problem when I try stop stop a drbd
resource and later restart it this always leads to a split brain.


And another thing too add which might be related: I just tried to
configure the resource's target-role to Started or Slave

But both sides stay in Master state... which is unexpected for me too.


Which is wrong again... sorry for that. If I configure Started they 
stay in Master and if I configure Slave they stay in slave. If you stop 
the resource


crm resource stop

It reconfigures target-role to Stopped and if you

crm resource start it configures target role Started which lets both 
come up in primary.


regards again.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Digimer

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-02 Thread Felix Zachlod

Am 02.10.2014 18:02, schrieb Digimer:

On 02/10/14 02:44 AM, Felix Zachlod wrote:

I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7


Please upgrade to 1.1.10+!



Are you referring to a special bug/ code change? I normally don't like 
building all this stuff from source instead using the packages if there 
are not very good reasons for it. I run some 1.1.7 debian base pacemaker 
clusters for a long time now without any issue and I am sure that this 
version seems to run very stable so as long as I am not facing a 
specific problem with this version I'd prefer sticking to it rather than 
putting brand new stuff from source together which might face other 
compatibility issues later on.



I am nearly sure that I found a hint to the problem:

adjust_master_score (string, [5 10 1000 1]): master score adjustments
Space separated list of four master score adjustments for different 
scenarios:

 - only access to 'consistent' data
 - only remote access to 'uptodate' data
 - currently Secondary, local access to 'uptodate' data, but remote 
is unknown


This is from the drbd resource agent's meta data.

As you can see the RA will report a master score of 1000 if it is 
secondary and (thinks) it has up to date data. According to the logs it 
is reporting 1000 though... I set a location rule with a score of -1001 
for the Master role and finally Pacemaker is waiting to promote the 
nodes to Master till the next monitor action when it notices until the 
nodes are connected and synced and report a MS of 1. What is 
interesting to me is


a) why do both drbd nodes think they have uptodate data when coming back 
online- at least one should know that it has been disconnected when 
another node was still up and consider that data might have been changed 
in the meantime. and in case I am rebooting a single node it can almost 
be sure that it has only consistent data cause the other side was 
still primary when shutting down this one


b) why does obviously nobody face this problem as it should behave like 
this in any primary primary cluster


but I think I will try passing this on to the drbd mailing list too.

regards, Felix



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-01 Thread Felix Zachlod

Hello!

I'm currently experimenting how a good DRBD Dual Primary Setup can be 
achieved with Pacemaker. I know all of the you have to have good 
fencing in place things ... that is just what I'am currently trying to 
test in my setup beside other things.


But even without a node crashing or the link dropping I already have the 
problem that I always run into a split brain situation when a node comes 
up that was e.g. in Standby before.


For example: I have both Nodes running connected both primary, 
everything is fine. I put one node into standby and DRBD is stopped on 
this node.


I do some work, reboot the server and so on finally I try to re join the 
node in the cluster. Pacemaker is starting all resources and finally 
DRBD drops the connection informing me about a split brain.


In the log this looks like:

Oct  1 19:44:42 storage-test-d kernel: [  111.138512] block drbd10: 
disk( Diskless - Attaching )
Oct  1 19:44:42 storage-test-d kernel: [  111.139283] drbd testdata1: 
Method to ensure write ordering: drain
Oct  1 19:44:42 storage-test-d kernel: [  111.139288] block drbd10: max 
BIO size = 1048576
Oct  1 19:44:42 storage-test-d kernel: [  111.139296] block drbd10: 
drbd_bm_resize called with capacity == 838835128
Oct  1 19:44:42 storage-test-d kernel: [  111.144488] block drbd10: 
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  1 19:44:42 storage-test-d kernel: [  111.144494] block drbd10: size 
= 400 GB (419417564 KB)
Oct  1 19:44:42 storage-test-d kernel: [  111.289327] block drbd10: 
recounting of set bits took additional 3 jiffies
Oct  1 19:44:42 storage-test-d kernel: [  111.289334] block drbd10: 0 KB 
(0 bits) marked out-of-sync by on disk bit-map.
Oct  1 19:44:42 storage-test-d kernel: [  111.289346] block drbd10: 
disk( Attaching - UpToDate )
Oct  1 19:44:42 storage-test-d kernel: [  111.289352] block drbd10: 
attached to UUIDs 
A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:42 storage-test-d kernel: [  111.321564] drbd testdata2: 
conn( StandAlone - Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.321628] drbd testdata2: 
Starting receiver thread (from drbd_w_testdata [3211])
Oct  1 19:44:42 storage-test-d kernel: [  111.321794] drbd testdata2: 
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.321822] drbd testdata2: 
conn( Unconnected - WFConnection )
Oct  1 19:44:42 storage-test-d kernel: [  111.337708] drbd testdata1: 
conn( StandAlone - Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.337764] drbd testdata1: 
Starting receiver thread (from drbd_w_testdata [3215])
Oct  1 19:44:42 storage-test-d kernel: [  111.337904] drbd testdata1: 
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.337927] drbd testdata1: 
conn( Unconnected - WFConnection )
Oct  1 19:44:43 storage-test-d kernel: [  111.808897] block drbd10: 
role( Secondary - Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.810883] block drbd11: 
role( Secondary - Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.820040] drbd testdata2: 
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.820046] drbd testdata2: 
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.823292] block drbd10: new 
current UUID 
8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:43 storage-test-d kernel: [  111.836096] drbd testdata1: 
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.836108] drbd testdata1: 
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.848917] block drbd11: new 
current UUID 
69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D
Oct  1 19:44:43 storage-test-d kernel: [  111.871100] drbd testdata2: 
conn( WFConnection - WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.871108] drbd testdata2: 
Starting asender thread (from drbd_r_testdata [3249])
Oct  1 19:44:43 storage-test-d kernel: [  111.909687] drbd testdata1: 
conn( WFConnection - WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.909695] drbd testdata1: 
Starting asender thread (from drbd_r_testdata [3270])
Oct  1 19:44:43 storage-test-d kernel: [  111.943986] drbd testdata2: 
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.944063] drbd testdata2: 
conn( WFReportParams - NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.944067] drbd testdata2: 
asender terminated
Oct  1 19:44:43 storage-test-d kernel: [  111.944070] drbd testdata2: 
Terminating drbd_a_testdata
Oct  1 19:44:43 storage-test-d kernel: [  111.988005] drbd testdata1: 
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.988089] drbd testdata1: 
conn( WFReportParams - NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.988094] drbd testdata1: 
asender terminated
Oct  1 

Re: [Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

2014-10-01 Thread Digimer

On 01/10/14 02:21 PM, Felix Zachlod wrote:

Hello!

I'm currently experimenting how a good DRBD Dual Primary Setup can be
achieved with Pacemaker. I know all of the you have to have good
fencing in place things ... that is just what I'am currently trying to
test in my setup beside other things.


Good fencing prevents split-brains. You really do need it, and now is 
the time to get it working.



But even without a node crashing or the link dropping I already have the
problem that I always run into a split brain situation when a node comes
up that was e.g. in Standby before.


Can you configure fencing, then reproduce? It will make debugging 
easier. Without fencing, things (drbd and pacemaker included) behave in 
somewhat unpredictable ways. With fencing, it should be easier to 
isolate the cause of the break.



For example: I have both Nodes running connected both primary,
everything is fine. I put one node into standby and DRBD is stopped on
this node.

I do some work, reboot the server and so on finally I try to re join the
node in the cluster. Pacemaker is starting all resources and finally
DRBD drops the connection informing me about a split brain.

In the log this looks like:

Oct  1 19:44:42 storage-test-d kernel: [  111.138512] block drbd10:
disk( Diskless - Attaching )
Oct  1 19:44:42 storage-test-d kernel: [  111.139283] drbd testdata1:
Method to ensure write ordering: drain
Oct  1 19:44:42 storage-test-d kernel: [  111.139288] block drbd10: max
BIO size = 1048576
Oct  1 19:44:42 storage-test-d kernel: [  111.139296] block drbd10:
drbd_bm_resize called with capacity == 838835128
Oct  1 19:44:42 storage-test-d kernel: [  111.144488] block drbd10:
resync bitmap: bits=104854391 words=1638350 pages=3200
Oct  1 19:44:42 storage-test-d kernel: [  111.144494] block drbd10: size
= 400 GB (419417564 KB)
Oct  1 19:44:42 storage-test-d kernel: [  111.289327] block drbd10:
recounting of set bits took additional 3 jiffies
Oct  1 19:44:42 storage-test-d kernel: [  111.289334] block drbd10: 0 KB
(0 bits) marked out-of-sync by on disk bit-map.
Oct  1 19:44:42 storage-test-d kernel: [  111.289346] block drbd10:
disk( Attaching - UpToDate )
Oct  1 19:44:42 storage-test-d kernel: [  111.289352] block drbd10:
attached to UUIDs
A41D74E79299A144::86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:42 storage-test-d kernel: [  111.321564] drbd testdata2:
conn( StandAlone - Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.321628] drbd testdata2:
Starting receiver thread (from drbd_w_testdata [3211])
Oct  1 19:44:42 storage-test-d kernel: [  111.321794] drbd testdata2:
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.321822] drbd testdata2:
conn( Unconnected - WFConnection )
Oct  1 19:44:42 storage-test-d kernel: [  111.337708] drbd testdata1:
conn( StandAlone - Unconnected )
Oct  1 19:44:42 storage-test-d kernel: [  111.337764] drbd testdata1:
Starting receiver thread (from drbd_w_testdata [3215])
Oct  1 19:44:42 storage-test-d kernel: [  111.337904] drbd testdata1:
receiver (re)started
Oct  1 19:44:42 storage-test-d kernel: [  111.337927] drbd testdata1:
conn( Unconnected - WFConnection )
Oct  1 19:44:43 storage-test-d kernel: [  111.808897] block drbd10:
role( Secondary - Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.810883] block drbd11:
role( Secondary - Primary )
Oct  1 19:44:43 storage-test-d kernel: [  111.820040] drbd testdata2:
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.820046] drbd testdata2:
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.823292] block drbd10: new
current UUID
8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1
Oct  1 19:44:43 storage-test-d kernel: [  111.836096] drbd testdata1:
Handshake successful: Agreed network protocol version 101
Oct  1 19:44:43 storage-test-d kernel: [  111.836108] drbd testdata1:
Agreed to support TRIM on protocol level
Oct  1 19:44:43 storage-test-d kernel: [  111.848917] block drbd11: new
current UUID
69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D
Oct  1 19:44:43 storage-test-d kernel: [  111.871100] drbd testdata2:
conn( WFConnection - WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.871108] drbd testdata2:
Starting asender thread (from drbd_r_testdata [3249])
Oct  1 19:44:43 storage-test-d kernel: [  111.909687] drbd testdata1:
conn( WFConnection - WFReportParams )
Oct  1 19:44:43 storage-test-d kernel: [  111.909695] drbd testdata1:
Starting asender thread (from drbd_r_testdata [3270])
Oct  1 19:44:43 storage-test-d kernel: [  111.943986] drbd testdata2:
meta connection shut down by peer.
Oct  1 19:44:43 storage-test-d kernel: [  111.944063] drbd testdata2:
conn( WFReportParams - NetworkFailure )
Oct  1 19:44:43 storage-test-d kernel: [  111.944067] drbd testdata2:
asender terminated
Oct  1 19:44:43 storage-test-d kernel: [  111.944070] drbd testdata2: