Re: [Pacemaker] Cluster with DRBD : split brain

2012-04-05 Thread Andreas Kurz
On 04/04/2012 03:40 PM, Hugo Deprez wrote:
> Hello,
> 
> thanks for the information.
> I was looking at this page
> http://www.drbd.org/users-guide/s-pacemaker-fencing.html
> I did specify the following handlers :
>  handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> 
> }
> 
> I disconnected the network cable between the clusters, corosync and drbd
> uses this link.
> 
> I was able to see that the fence script added a constraint :
> 
>  location drbd-fence-by-handler-ms-drbd-supervision ms-drbd-supervision \
> rule $id="drbd-fence-by-handler-rule-ms-drbd-supervision"
> $role="Master" -inf: #uname ne host

As expected, now Pacemaker won't try to promote that DRBD resource on
any node but "host"

> But this made :
> 
> :StandAlone ro:Secondary/Unknown ds:UpToDate/Outdated  on drbd.

also expected

> 
> I don't really understand what I should be expected from those handlers ?
> When cleaning up the errors, I shoudl delete the constraint right ?
> 

The constraint is cleared automatically, once the resync is finished -->
after-resync-target handler ... after your did the cleanup and
reconnected the resources.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> Regards,
> 
> Hugo
> 
> On 26 July 2011 19:27, Digimer  > wrote:
> 
> On 07/26/2011 11:43 AM, Lars Ellenberg wrote:
> > On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
> >> On 07/20/2011 11:24 AM, Hugo Deprez wrote:
> >>> Hello Andrew,
> >>>
> >>> in fact DRBD was in standalone mode but the cluster was working :
> >>>
> >>> Here is the syslog of the drbd's split brain :
> >>>
> >>> Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0:
> Handshake
> >>> successful: Agreed network protocol version 91
> >>> Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
> >>> WFConnection -> WFReportParams )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
> >>> asender thread (from drbd0_receiver [23281])
> >>> Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
> >>> data-integrity-alg: 
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
> >>> drbd_sync_handshake:
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
> >>> BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
> >>> bits:75338 flags:0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
> >>> 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
> >>> bits:769 flags:0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
> >>> uuid_compare()=100 by rule 90
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0:
> Split-Brain
> >>> detected, dropping connection!
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
> >>> command: /sbin/drbdadm split-brain minor-0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
> >>> connection shut down by peer.
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
> >>> WFReportParams -> NetworkFailure )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
> >>> terminated
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0:
> Terminating
> >>> asender thread
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
> >>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
> >>> NetworkFailure -> Disconnecting )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
> >>> receiving ReportState, l: 4!
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0:
> Connection
> >>> closed
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
> >>> Disconnecting -> StandAlone )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
> >>> terminated
> >>> Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0:
> Terminating
> >>> receiver thread
> >>
> >> This was a DRBD split-brain, not a pacemaker split. I think that
> might
> >> have been the source of confusion.
> >>
> >> The split brain occurs when both DRBD nodes lose contact with one
> >> another and then proceed as StandAlone/Primary/UpToDate. To avoid
> this,
> >> configure fencing (stonith) in Pacemaker, then use
> 'crm-fence-peer.sh'
> >> in drbd.conf;
> >>
> >> ===
> >> disk {
> >> fencing resource-and-stonith;
> >> }
> >>
> >> handlers {
> >> outdate-peer"/path/to/crm-fence-peer.sh

Re: [Pacemaker] Cluster with DRBD : split brain

2012-04-04 Thread Hugo Deprez
Hello,

thanks for the information.
I was looking at this page
http://www.drbd.org/users-guide/s-pacemaker-fencing.html
I did specify the following handlers :
 handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";

}

I disconnected the network cable between the clusters, corosync and drbd
uses this link.

I was able to see that the fence script added a constraint :

 location drbd-fence-by-handler-ms-drbd-supervision ms-drbd-supervision \
rule $id="drbd-fence-by-handler-rule-ms-drbd-supervision"
$role="Master" -inf: #uname ne host

But this made :

:StandAlone ro:Secondary/Unknown ds:UpToDate/Outdated  on drbd.

I don't really understand what I should be expected from those handlers ?
When cleaning up the errors, I shoudl delete the constraint right ?

Regards,

Hugo

On 26 July 2011 19:27, Digimer  wrote:

> On 07/26/2011 11:43 AM, Lars Ellenberg wrote:
> > On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
> >> On 07/20/2011 11:24 AM, Hugo Deprez wrote:
> >>> Hello Andrew,
> >>>
> >>> in fact DRBD was in standalone mode but the cluster was working :
> >>>
> >>> Here is the syslog of the drbd's split brain :
> >>>
> >>> Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
> >>> successful: Agreed network protocol version 91
> >>> Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
> >>> WFConnection -> WFReportParams )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
> >>> asender thread (from drbd0_receiver [23281])
> >>> Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
> >>> data-integrity-alg: 
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
> >>> drbd_sync_handshake:
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
> >>> BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
> >>> bits:75338 flags:0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
> >>> 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
> >>> bits:769 flags:0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
> >>> uuid_compare()=100 by rule 90
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
> >>> detected, dropping connection!
> >>> Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
> >>> command: /sbin/drbdadm split-brain minor-0
> >>> Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
> >>> connection shut down by peer.
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
> >>> WFReportParams -> NetworkFailure )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
> >>> terminated
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
> >>> asender thread
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
> >>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
> >>> NetworkFailure -> Disconnecting )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
> >>> receiving ReportState, l: 4!
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
> >>> closed
> >>> Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
> >>> Disconnecting -> StandAlone )
> >>> Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
> >>> terminated
> >>> Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
> >>> receiver thread
> >>
> >> This was a DRBD split-brain, not a pacemaker split. I think that might
> >> have been the source of confusion.
> >>
> >> The split brain occurs when both DRBD nodes lose contact with one
> >> another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
> >> configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
> >> in drbd.conf;
> >>
> >> ===
> >> disk {
> >> fencing resource-and-stonith;
> >> }
> >>
> >> handlers {
> >> outdate-peer"/path/to/crm-fence-peer.sh";
> >> }
> >> ===
> >
> > Thanks, that is basically right.
> > Let me fill in some details, though:
> >
> >> This will tell DRBD to block (resource) and fence (stonith). DRBD will
> >
> > drbd fencing options are "fencing resource-only",
> > and "fencing resource-and-stonith".
> >
> > "resource-only" does *not* block IO while the fencing handler runs.
> >
> > "resource-and-stonith" does block IO.
>
> Ahhh, that's why I was confused. I thought the 'resource' meant the same
> thing in both cases, but had only read the 'resource-and-stonith' section.
>
> >> not resume IO until either the fence script exits with a success, or
> >> until an admit types 'drbdadm resume-io '.
> >
> >
> >> The CRM script simply calls pacemaker and asks it to fence the other
> >> node.
> >
> > No.  It tries to plac

Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-26 Thread Digimer
On 07/26/2011 11:43 AM, Lars Ellenberg wrote:
> On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
>> On 07/20/2011 11:24 AM, Hugo Deprez wrote:
>>> Hello Andrew,
>>>
>>> in fact DRBD was in standalone mode but the cluster was working :
>>>
>>> Here is the syslog of the drbd's split brain :
>>>
>>> Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
>>> successful: Agreed network protocol version 91
>>> Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
>>> WFConnection -> WFReportParams )
>>> Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
>>> asender thread (from drbd0_receiver [23281])
>>> Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
>>> data-integrity-alg: 
>>> Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
>>> drbd_sync_handshake:
>>> Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
>>> BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
>>> bits:75338 flags:0
>>> Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
>>> 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
>>> bits:769 flags:0
>>> Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
>>> uuid_compare()=100 by rule 90
>>> Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
>>> detected, dropping connection!
>>> Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
>>> command: /sbin/drbdadm split-brain minor-0
>>> Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
>>> connection shut down by peer.
>>> Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
>>> WFReportParams -> NetworkFailure )
>>> Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
>>> terminated
>>> Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
>>> asender thread
>>> Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
>>> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
>>> Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
>>> NetworkFailure -> Disconnecting )
>>> Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
>>> receiving ReportState, l: 4!
>>> Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
>>> closed
>>> Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
>>> Disconnecting -> StandAlone )
>>> Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
>>> terminated
>>> Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
>>> receiver thread
>>
>> This was a DRBD split-brain, not a pacemaker split. I think that might
>> have been the source of confusion.
>>
>> The split brain occurs when both DRBD nodes lose contact with one
>> another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
>> configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
>> in drbd.conf;
>>
>> ===
>> disk {
>> fencing resource-and-stonith;
>> }
>>
>> handlers {
>> outdate-peer"/path/to/crm-fence-peer.sh";
>> }
>> ===
> 
> Thanks, that is basically right.
> Let me fill in some details, though:
> 
>> This will tell DRBD to block (resource) and fence (stonith). DRBD will
> 
> drbd fencing options are "fencing resource-only",
> and "fencing resource-and-stonith". 
> 
> "resource-only" does *not* block IO while the fencing handler runs.
> 
> "resource-and-stonith" does block IO.

Ahhh, that's why I was confused. I thought the 'resource' meant the same
thing in both cases, but had only read the 'resource-and-stonith' section.

>> not resume IO until either the fence script exits with a success, or
>> until an admit types 'drbdadm resume-io '.
> 
> 
>> The CRM script simply calls pacemaker and asks it to fence the other
>> node.
> 
> No.  It tries to place a constraint forcing the Master role off of any
> node but the one with the good data.

Ok, I thought it was akin to the 'obliterate-peer.sh' script, which
calls 'fence_node'... I made an assumption, which was not correct.

>> When a node has actually failed, then the lost no is fenced. If
>> both nodes are up but disconnected, as you had, then only the fastest
>> node will succeed in calling the fence, and the slower node will be
>> fenced before it can call a fence.
> 
> "fenced" may be "restricted from being/becoming Master" by that fencing
> constraint. Or, if pacemaker decided to do so, actually "shot" by some
> node level fencing agent (stonith).
> 
> All that resource-level fencing by placing some constraint stuff
> obviously only works as long as the cluster communication is still up.
> It not only the drbd replication link had issues, but the cluster
> communication was down as well, it becomes a bit more complex.

Thanks for the clarity. Today I learned. :)

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: htt

Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-26 Thread Lars Ellenberg
On Wed, Jul 20, 2011 at 11:36:25AM -0400, Digimer wrote:
> On 07/20/2011 11:24 AM, Hugo Deprez wrote:
> > Hello Andrew,
> > 
> > in fact DRBD was in standalone mode but the cluster was working :
> > 
> > Here is the syslog of the drbd's split brain :
> > 
> > Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
> > successful: Agreed network protocol version 91
> > Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
> > WFConnection -> WFReportParams )
> > Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
> > asender thread (from drbd0_receiver [23281])
> > Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
> > data-integrity-alg: 
> > Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
> > drbd_sync_handshake:
> > Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
> > BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
> > bits:75338 flags:0
> > Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
> > 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
> > bits:769 flags:0
> > Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
> > uuid_compare()=100 by rule 90
> > Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
> > detected, dropping connection!
> > Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
> > command: /sbin/drbdadm split-brain minor-0
> > Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
> > connection shut down by peer.
> > Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
> > WFReportParams -> NetworkFailure )
> > Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
> > terminated
> > Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
> > asender thread
> > Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
> > command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> > Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
> > NetworkFailure -> Disconnecting )
> > Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
> > receiving ReportState, l: 4!
> > Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
> > closed
> > Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
> > Disconnecting -> StandAlone )
> > Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
> > terminated
> > Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
> > receiver thread
> 
> This was a DRBD split-brain, not a pacemaker split. I think that might
> have been the source of confusion.
> 
> The split brain occurs when both DRBD nodes lose contact with one
> another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
> configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
> in drbd.conf;
> 
> ===
> disk {
> fencing resource-and-stonith;
> }
> 
> handlers {
> outdate-peer"/path/to/crm-fence-peer.sh";
> }
> ===

Thanks, that is basically right.
Let me fill in some details, though:

> This will tell DRBD to block (resource) and fence (stonith). DRBD will

drbd fencing options are "fencing resource-only",
and "fencing resource-and-stonith". 

"resource-only" does *not* block IO while the fencing handler runs.

"resource-and-stonith" does block IO.

> not resume IO until either the fence script exits with a success, or
> until an admit types 'drbdadm resume-io '.


> The CRM script simply calls pacemaker and asks it to fence the other
> node.

No.  It tries to place a constraint forcing the Master role off of any
node but the one with the good data.

> When a node has actually failed, then the lost no is fenced. If
> both nodes are up but disconnected, as you had, then only the fastest
> node will succeed in calling the fence, and the slower node will be
> fenced before it can call a fence.

"fenced" may be "restricted from being/becoming Master" by that fencing
constraint. Or, if pacemaker decided to do so, actually "shot" by some
node level fencing agent (stonith).

All that resource-level fencing by placing some constraint stuff
obviously only works as long as the cluster communication is still up.
It not only the drbd replication link had issues, but the cluster
communication was down as well, it becomes a bit more complex.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-20 Thread Digimer
On 07/20/2011 11:24 AM, Hugo Deprez wrote:
> Hello Andrew,
> 
> in fact DRBD was in standalone mode but the cluster was working :
> 
> Here is the syslog of the drbd's split brain :
> 
> Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
> successful: Agreed network protocol version 91
> Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
> WFConnection -> WFReportParams )
> Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting
> asender thread (from drbd0_receiver [23281])
> Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
> data-integrity-alg: 
> Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
> drbd_sync_handshake:
> Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
> BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
> bits:75338 flags:0
> Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
> 8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F
> bits:769 flags:0
> Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
> uuid_compare()=100 by rule 90
> Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
> detected, dropping connection!
> Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0
> Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta
> connection shut down by peer.
> Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
> WFReportParams -> NetworkFailure )
> Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
> terminated
> Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
> asender thread
> Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper
> command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
> Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
> NetworkFailure -> Disconnecting )
> Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error
> receiving ReportState, l: 4!
> Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
> closed
> Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
> Disconnecting -> StandAlone )
> Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
> terminated
> Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
> receiver thread

This was a DRBD split-brain, not a pacemaker split. I think that might
have been the source of confusion.

The split brain occurs when both DRBD nodes lose contact with one
another and then proceed as StandAlone/Primary/UpToDate. To avoid this,
configure fencing (stonith) in Pacemaker, then use 'crm-fence-peer.sh'
in drbd.conf;

===
disk {
fencing resource-and-stonith;
}

handlers {
outdate-peer"/path/to/crm-fence-peer.sh";
}
===

This will tell DRBD to block (resource) and fence (stonith). DRBD will
not resume IO until either the fence script exits with a success, or
until an admit types 'drbdadm resume-io '.

The CRM script simply calls pacemaker and asks it to fence the other
node. When a node has actually failed, then the lost no is fenced. If
both nodes are up but disconnected, as you had, then only the fastest
node will succeed in calling the fence, and the slower node will be
fenced before it can call a fence.

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
"At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?"

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-20 Thread Hugo Deprez
Hello Andrew,

in fact DRBD was in standalone mode but the cluster was working :

Here is the syslog of the drbd's split brain :

Jul 15 08:45:34 node1 kernel: [1536023.052245] block drbd0: Handshake
successful: Agreed network protocol version 91
Jul 15 08:45:34 node1 kernel: [1536023.052267] block drbd0: conn(
WFConnection -> WFReportParams )
Jul 15 08:45:34 node1 kernel: [1536023.066677] block drbd0: Starting asender
thread (from drbd0_receiver [23281])
Jul 15 08:45:34 node1 kernel: [1536023.066863] block drbd0:
data-integrity-alg: 
Jul 15 08:45:34 node1 kernel: [1536023.079182] block drbd0:
drbd_sync_handshake:
Jul 15 08:45:34 node1 kernel: [1536023.079190] block drbd0: self
BBA9B794EDB65CDF:9E8FB52F896EF383:C5FE44742558F9E1:1F9E06135B8E296F
bits:75338 flags:0
Jul 15 08:45:34 node1 kernel: [1536023.079196] block drbd0: peer
8343B5F30B2BF674:9E8FB52F896EF382:C5FE44742558F9E0:1F9E06135B8E296F bits:769
flags:0
Jul 15 08:45:34 node1 kernel: [1536023.079200] block drbd0:
uuid_compare()=100 by rule 90
Jul 15 08:45:34 node1 kernel: [1536023.079203] block drbd0: Split-Brain
detected, dropping connection!
Jul 15 08:45:34 node1 kernel: [1536023.079439] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0
Jul 15 08:45:34 node1 kernel: [1536023.083955] block drbd0: meta connection
shut down by peer.
Jul 15 08:45:34 node1 kernel: [1536023.084163] block drbd0: conn(
WFReportParams -> NetworkFailure )
Jul 15 08:45:34 node1 kernel: [1536023.084173] block drbd0: asender
terminated
Jul 15 08:45:34 node1 kernel: [1536023.084176] block drbd0: Terminating
asender thread
Jul 15 08:45:34 node1 kernel: [1536023.084406] block drbd0: helper command:
/sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Jul 15 08:45:34 node1 kernel: [1536023.084420] block drbd0: conn(
NetworkFailure -> Disconnecting )
Jul 15 08:45:34 node1 kernel: [1536023.084430] block drbd0: error receiving
ReportState, l: 4!
Jul 15 08:45:34 node1 kernel: [1536023.084789] block drbd0: Connection
closed
Jul 15 08:45:34 node1 kernel: [1536023.084813] block drbd0: conn(
Disconnecting -> StandAlone )
Jul 15 08:45:34 node1 kernel: [1536023.086345] block drbd0: receiver
terminated
Jul 15 08:45:34 node1 kernel: [1536023.086349] block drbd0: Terminating
receiver thread


On 19 July 2011 02:30, Andrew Beekhof  wrote:

> On Fri, Jul 15, 2011 at 7:58 PM, Hugo Deprez 
> wrote:
> > Dear community,
> >
> > I am running on Debian Lenny, a cluster with corosync. I have :
> >
> > One DRBD partition and 4 resources :
> >
> > fs-data(ocf::heartbeat:Filesystem):
> > mda-ip (ocf::heartbeat:IPaddr2):
> > postfix(ocf::heartbeat:postfix):
> > apache (ocf::heartbeat:apache):
> >
> > Last night something happens and DRBD had a 'split brain'. I think the
> split
> > brain come from
> >
> > The resources was still running on the node 1.
> >
> > I checked the corosync logs and seems that something went wrong, I would
> > like to understand what happen, in order to improve my cluster
> > configuration.
> >
> > Please find attach  the log file.
>
> I see no evidence of a split-brain. Both nodes appear to be able to
> talk to each other.
> What exactly is the problem you encountered?
>
> >
> > It seems that the cluster tried to migrate the resources to the other
> node
> > but didn't succeed ?
> >
> > Any help appreciated.
> >
> > Regards,
> >
> > Hugo
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> >
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> >
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster with DRBD : split brain

2011-07-18 Thread Andrew Beekhof
On Fri, Jul 15, 2011 at 7:58 PM, Hugo Deprez  wrote:
> Dear community,
>
> I am running on Debian Lenny, a cluster with corosync. I have :
>
> One DRBD partition and 4 resources :
>
> fs-data    (ocf::heartbeat:Filesystem):
> mda-ip (ocf::heartbeat:IPaddr2):
> postfix    (ocf::heartbeat:postfix):
> apache (ocf::heartbeat:apache):
>
> Last night something happens and DRBD had a 'split brain'. I think the split
> brain come from
>
> The resources was still running on the node 1.
>
> I checked the corosync logs and seems that something went wrong, I would
> like to understand what happen, in order to improve my cluster
> configuration.
>
> Please find attach  the log file.

I see no evidence of a split-brain. Both nodes appear to be able to
talk to each other.
What exactly is the problem you encountered?

>
> It seems that the cluster tried to migrate the resources to the other node
> but didn't succeed ?
>
> Any help appreciated.
>
> Regards,
>
> Hugo
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Cluster with DRBD : split brain

2011-07-15 Thread Hugo Deprez
Dear community,

I am running on Debian Lenny, a cluster with corosync. I have :

One DRBD partition and 4 resources :

fs-data(ocf::heartbeat:Filesystem):
mda-ip (ocf::heartbeat:IPaddr2):
postfix(ocf::heartbeat:postfix):
apache (ocf::heartbeat:apache):

Last night something happens and DRBD had a 'split brain'. I think the split
brain come from

The resources was still running on the node 1.

I checked the corosync logs and seems that something went wrong, I would
like to understand what happen, in order to improve my cluster
configuration.

Please find attach  the log file.

It seems that the cluster tried to migrate the resources to the other node
but didn't succeed ?

Any help appreciated.

Regards,

Hugo


corosync.log
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker