Hi,

Thank you for your quick response,

I am indeed using a diskless watchdog,

I have already looked into a setting up a device dependant watchdog, but 
wouldn’t that create a single point of failure in the case that common drive 
becomes unavailable ?


[SOGET]

Raphael DUBOIS-LISKI
Ingénieur Système et Réseau
+33 2 35 19 25 54
SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR
[web]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09>
[linkedin]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09>
[twitter]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09>
Disclaimer<http://soget.fr/disclaimer>

De : Damiano Giuliani <damianogiulian...@gmail.com>
Envoyé : mardi 5 décembre 2023 18:30
À : Cluster Labs - All topics related to open-source clustering welcomed 
<users@clusterlabs.org>
Objet : Re: [ClusterLabs] Setting up an Active/Active Pacemaker cluster for a 
Postfix/Dovecot cluster, using a DRBD backend for the data storage

It could be the watchdog? Are u using diskless watchdog?Two nodes are not 
supported in diskless mode.

On Tue, Dec 5, 2023, 5:40 PM Raphael DUBOIS-LISKI 
<raphael.dubois-li...@soget.fr<mailto:raphael.dubois-li...@soget.fr>> wrote:
Hello,

I am seeking help for the setup of an Active/Active pacemaker cluster that 
relies on a DRBD cluster as the data storage backend, as the solution is 
mounted on 2 RHEL9 VMs, the file system used is GFS2.

Linked, is a PDF of the infrastructure that I am currently experimenting on.

For context, this is my Pacemaker cluster config:

Cluster Name: mycluster
Corosync Nodes:
Node1 Node2
Pacemaker Nodes:
Node1 Node2

Resources:
  Clone: Data-clone
    Meta Attributes: Data-clone-meta_attributes
      clone-max=2
      clone-node-max=1
      notify=true
      promotable=true
      promoted-max=2
      promoted-node-max=1
    Resource: Data (class=ocf provider=linbit type=drbd)
      Attributes: Data-instance_attributes
        drbd_resource=drbd0
      Operations:
        demote: Data-demote-interval-0s
          interval=0s timeout=90
       monitor: Data-monitor-interval-60s
          interval=60s
        notify: Data-notify-interval-0s
          interval=0s timeout=90
        promote: Data-promote-interval-0s
          interval=0s timeout=90
        reload: Data-reload-interval-0s
          interval=0s timeout=30
        start: Data-start-interval-0s
          interval=0s timeout=240
        stop: Data-stop-interval-0s
          interval=0s timeout=100
  Clone: dlm-clone
    Meta Attributes: dlm-clone-meta_attributes
      clone-max=2
      clone-node-max=1
    Resource: dlm (class=ocf provider=pacemaker type=controld)
      Operations:
        monitor: dlm-monitor-interval-60s
          interval=60s
        start: dlm-start-interval-0s
          interval=0s timeout=90s
        stop: dlm-stop-interval-0s
          interval=0s timeout=100s
  Clone: FS-clone
    Resource: FS (class=ocf provider=heartbeat type=Filesystem)
      Attributes: FS-instance_attributes
        device=/dev/drbd0
        directory=/home/vusers
        fstype=gfs2
      Operations:
        monitor: FS-monitor-interval-20s
          interval=20s timeout=40s
        start: FS-start-interval-0s
          interval=0s timeout=60s
        stop: FS-stop-interval-0s
          interval=0s timeout=60s
  Clone: smtp_postfix-clone
    Meta Attributes: smtp_postfix-clone-meta_attributes
      clone-max=2
      clone-node-max=1
    Resource: smtp_postfix (class=ocf provider=heartbeat type=postfix)
      Operations:
        monitor: smtp_postfix-monitor-interval-60s
          interval=60s timeout=20s
        reload: smtp_postfix-reload-interval-0s
          interval=0s timeout=20s
        start: smtp_postfix-start-interval-0s
          interval=0s timeout=20s
        stop: smtp_postfix-stop-interval-0s
          interval=0s timeout=20s
  Clone: WebSite-clone
    Resource: WebSite (class=ocf provider=heartbeat type=apache)
      Attributes: WebSite-instance_attributes
        configfile=/etc/httpd/conf/httpd.conf
        statusurl=http://localhost/server-status
      Operations:
        monitor: WebSite-monitor-interval-1min
          interval=1min
        start: WebSite-start-interval-0s
          interval=0s timeout=40s
        stop: WebSite-stop-interval-0s
          interval=0s timeout=60s

Colocation Constraints:
  resource 'FS-clone' with Promoted resource 'Data-clone' (id: colocation-FS-
      Data-clone-INFINITY)
    score=INFINITY
  resource 'WebSite-clone' with resource 'FS-clone' (id: colocation-WebSite-FS-
      INFINITY)
    score=INFINITY
  resource 'FS-clone' with resource 'dlm-clone' (id: colocation-FS-dlm-clone-
      INFINITY)
    score=INFINITY
  resource 'FS-clone' with resource 'smtp_postfix-clone' (id: colocation-FS-
      clone-smtp_postfix-clone-INFINITY)
    score=INFINITY
Order Constraints:
  promote resource 'Data-clone' then start resource 'FS-clone' (id: order-Data-
      clone-FS-mandatory)
  start resource 'FS-clone' then start resource 'WebSite-clone' (id: order-FS-
      WebSite-mandatory)
  start resource 'dlm-clone' then start resource 'FS-clone' (id: order-dlm-
      clone-FS-mandatory)
  start resource 'FS-clone' then start resource 'smtp_postfix-clone' (id: order-
      FS-clone-smtp_postfix-clone-mandatory)

Resources Defaults:
  Meta Attrs: build-resource-defaults
    resource-stickiness=1 (id: build-resource-stickiness)

Operations Defaults:
  Meta Attrs: op_defaults-meta_attributes
    timeout=240s (id: op_defaults-meta_attributes-timeout)

Cluster Properties: cib-bootstrap-options
  cluster-infrastructure=corosync
  cluster-name=mycluster
  dc-version=2.1.6-9.el9-6fdc9deea29
  have-watchdog=true
  last-lrm-refresh=1701787695
  no-quorum-policy=ignore
  stonith-enabled=true
  stonith-watchdog-timeout=10







And this is my DRBD configuration :

global {
  usage-count no;
}
common {
  disk {
    resync-rate 100M;
    al-extents 257;
  }
}
resource drbd0 {
  protocol C;
    handlers {
      pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
      pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; 
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
      local-io-error "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt;
     fence-peer        
"/usr/lib/drbd/crm-fence-peer.9.sh<http://crm-fence-peer.9.sh>";
      after-resync-target 
"/usr/lib/drbd/crm-unfence-peer.9.sh<http://crm-unfence-peer.9.sh>";
      split-brain "/usr/lib/drbd/notify-split-brain.sh root";
      out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
    }
    startup {
      wfc-timeout       1;
      degr-wfc-timeout  1;
      become-primary-on both;
    }
    net {
      # The following lines are dedicated to handle
      # split-brain situations (e.g., if one of the nodes fails)
      after-sb-0pri discard-zero-changes; # If both nodes are secondary, just 
make one of them primary
      after-sb-1pri discard-secondary; # If one is primary, one is not, trust 
the primary node
      after-sb-2pri     disconnect;
      allow-two-primaries yes;
      verify-alg        sha1;
    }
    disk {
      on-io-error detach;
    }
    options {
      auto-promote yes;
    }
    on fradevtestmail1 {
       device /dev/drbd0;
       disk /dev/rootvg/drbdlv;
       address X.X.X.X:7788;
       flexible-meta-disk internal;
    }
    on fradevtestmail2 {
       device /dev/drbd0;
       disk /dev/rootvg/drbdlv;
       address X.X.X.X:7788;
       flexible-meta-disk internal;
    }
}


Knowing all this,

The cluster works perfectly as expected when both nodes are up, but a problem 
arises when I put the cluster in a degraded state by killing one of the nodes 
improperly (to simulate an unexpected crash).

This causes the remaining node to reboot, restart the cluster, with all going 
well in the resource start process, until it's time to mount the File System, 
where it times out and fails.

Would you have any idea why this behaviour happens, and possibly how I would be 
able to fix this behaviour, so that the cluster is still usable even with one 
node down?
Until we can get the second node back and running in case of an unexpected 
crash ?

Many thanks for your help,

Have a nice day,

BR,



[SOGET]

Raphael DUBOIS-LISKI
Ingénieur Système et Réseau
+33 2 35 19 25 54
SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR
[web]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09>
[linkedin]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09>
[twitter]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09>
Disclaimer<http://soget.fr/disclaimer>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to