On 22/05/14 10:47 AM, Robert Dahlem wrote:
Hi,

I have a 4-Node-Cluster (korfwf01, korfwf02, korfwm01, korfwm02).

There is a DRBD resource which should only run on korfwf01 korfwf02:

primitive DRBD-ffm ocf:linbit:drbd params drbd_resource=ffm \
    op start interval=0 timeout=240 \
    op promote interval=0 timeout=90 \
    op demote interval=0 timeout=90 \
    op notify interval=0 timeout=90 \
    op stop interval=0 timeout=100 \
    op monitor role=Slave timeout=20 interval=20 \
    op monitor role=Master timeout=20 interval=10
ms ms-DRBD-ffm DRBD-ffm \
    meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
location loc-ms-DRBD-ffm-korfwm01 ms-DRBD-ffm -inf: korfwm01
location loc-ms-DRBD-ffm-korfwm02 ms-DRBD-ffm -inf: korfwm02

I would like to have a Dummy resource "All-ffm" working much like a
group, but not that strict. If I move that Dummy resource from node to
node, other resources depending on it should follow.

primitive ALL-ffm ocf:heartbeat:Dummy
location loc-ALL-ffm-korfwf01 ALL-ffm 2: korfwf01
location loc-ALL-ffm-korfwf02 ALL-ffm 1: korfwf02
location loc-ALL-ffm-korfwm01 ALL-ffm -inf: korfwm01
location loc-ALL-ffm-korfwm02 ALL-ffm -inf: korfwm02
colocation coloc-ms-DRBD-ffm-with-ALL-ffm inf: ms-DRBD-ffm:Master ALL-ffm
order ord-ALL-ffm-before-DRBD-ffm inf: ALL-ffm ms-DRBD-ffm

In the beginning everything is ok:
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf01 ]
      Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
   7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate

Standby korfwf01, resources are expected to move to korfwf02:
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf02 ]
      Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
   7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown

Standby korfwf02, resources are expected to stop
# crm node standby korfwf02
# crm status
./.
# ssh korfwf01 drbd-overview
   7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
   7:ffm/0      Unconfigured . . . .

Online korfwf02, resources are expected to start on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf02 ]
      Stopped: [ korfwf01 korfwm01 korfwm02 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      Unconfigured . . . .
# ssh korfwf02 drbd-overview
   7:ffm/0  WFConnection Primary/Unknown UpToDate/DUnknown

Online korfwf01, resources are expected to STAY on korfwf02
# crm node online korfwf02
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf02
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf02 ]
      Slaves: [ korfwf01 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      Connected    Secondary/Primary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
   7:ffm/0  Connected Primary/Secondary UpToDate/UpToDate

Move ALL-ffm to korfwf01, resources are expected to move to korfwf01
# crm resource move ALL-ffm korfwf01
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf01 ]
      Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      Connected    Primary/Secondary UpToDate/UpToDate
# ssh korfwf02 drbd-overview
   7:ffm/0  Connected Secondary/Primary UpToDate/UpToDate

Now I "forget" to unmove ALL-ffm and repeat the sequence
# crm node standby korfwf01 ; sleep 10
# crm node standby korfwf02 ; sleep 10
# crm node online korfwf02 ; sleep 10
# crm node online korfwf01 ; sleep 10
# crm status
ALL-ffm        (ocf::heartbeat:Dummy): Started korfwf01
  Master/Slave Set: ms-DRBD-ffm [DRBD-ffm]
      Masters: [ korfwf01 ]
      Slaves: [ korfwf02 ]
# ssh korfwf01 drbd-overview
   7:ffm/0      StandAlone   Primary/Unknown UpToDate/DUnknown
# ssh korfwf02 drbd-overview
   7:ffm/0  WFConnection Secondary/Unknown UpToDate/DUnknown

*BANG* reproducible DRBD split-brain after the last step.

This does NOT happen without the dependencies on the Dummy resource. I
think there might be some unfortunate timing of drbd start and stop
commands.

SLES 11 SP3
drbd-8.4.4-0.22.9
drbd-pacemaker-8.4.4-0.22.9
pacemaker-1.1.10-0.15.25

What can I provide to help analyze this?

Kind regards,
Robert

I can't speak to the pacemaker issue, but I can say that a proper stonith config in pacemaker and fencing config in drbd would prevent a split-brain. This would cause a node to reboot in this scenario, so you still need to resolve it, but a reboot is a heck of a lot better than a split-brain.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to