It could be the watchdog? Are u using diskless watchdog?Two nodes are not
supported in diskless mode.

On Tue, Dec 5, 2023, 5:40 PM Raphael DUBOIS-LISKI <
raphael.dubois-li...@soget.fr> wrote:

> Hello,
>
>
>
> I am seeking help for the setup of an Active/Active pacemaker cluster that
> relies on a DRBD cluster as the data storage backend, as the solution is
> mounted on 2 RHEL9 VMs, the file system used is GFS2.
>
>
>
> Linked, is a PDF of the infrastructure that I am currently experimenting
> on.
>
>
>
> For context, this is my Pacemaker cluster config:
>
>
>
> Cluster Name: mycluster
>
> Corosync Nodes:
>
> Node1 Node2
>
> Pacemaker Nodes:
>
> Node1 Node2
>
>
>
> Resources:
>
>   Clone: Data-clone
>
>     Meta Attributes: Data-clone-meta_attributes
>
>       clone-max=2
>
>       clone-node-max=1
>
>       notify=true
>
>       promotable=true
>
>       promoted-max=2
>
>       promoted-node-max=1
>
>     Resource: Data (class=ocf provider=linbit type=drbd)
>
>       Attributes: Data-instance_attributes
>
>         drbd_resource=drbd0
>
>       Operations:
>
>         demote: Data-demote-interval-0s
>
>           interval=0s timeout=90
>
>        monitor: Data-monitor-interval-60s
>
>           interval=60s
>
>         notify: Data-notify-interval-0s
>
>           interval=0s timeout=90
>
>         promote: Data-promote-interval-0s
>
>           interval=0s timeout=90
>
>         reload: Data-reload-interval-0s
>
>           interval=0s timeout=30
>
>         start: Data-start-interval-0s
>
>           interval=0s timeout=240
>
>         stop: Data-stop-interval-0s
>
>           interval=0s timeout=100
>
>   Clone: dlm-clone
>
>     Meta Attributes: dlm-clone-meta_attributes
>
>       clone-max=2
>
>       clone-node-max=1
>
>     Resource: dlm (class=ocf provider=pacemaker type=controld)
>
>       Operations:
>
>         monitor: dlm-monitor-interval-60s
>
>           interval=60s
>
>         start: dlm-start-interval-0s
>
>           interval=0s timeout=90s
>
>         stop: dlm-stop-interval-0s
>
>           interval=0s timeout=100s
>
>   Clone: FS-clone
>
>     Resource: FS (class=ocf provider=heartbeat type=Filesystem)
>
>       Attributes: FS-instance_attributes
>
>         device=/dev/drbd0
>
>         directory=/home/vusers
>
>         fstype=gfs2
>
>       Operations:
>
>         monitor: FS-monitor-interval-20s
>
>           interval=20s timeout=40s
>
>         start: FS-start-interval-0s
>
>           interval=0s timeout=60s
>
>         stop: FS-stop-interval-0s
>
>           interval=0s timeout=60s
>
>   Clone: smtp_postfix-clone
>
>     Meta Attributes: smtp_postfix-clone-meta_attributes
>
>       clone-max=2
>
>       clone-node-max=1
>
>     Resource: smtp_postfix (class=ocf provider=heartbeat type=postfix)
>
>       Operations:
>
>         monitor: smtp_postfix-monitor-interval-60s
>
>           interval=60s timeout=20s
>
>         reload: smtp_postfix-reload-interval-0s
>
>           interval=0s timeout=20s
>
>         start: smtp_postfix-start-interval-0s
>
>           interval=0s timeout=20s
>
>         stop: smtp_postfix-stop-interval-0s
>
>           interval=0s timeout=20s
>
>   Clone: WebSite-clone
>
>     Resource: WebSite (class=ocf provider=heartbeat type=apache)
>
>       Attributes: WebSite-instance_attributes
>
>         configfile=/etc/httpd/conf/httpd.conf
>
>         statusurl=http://localhost/server-status
>
>       Operations:
>
>         monitor: WebSite-monitor-interval-1min
>
>           interval=1min
>
>         start: WebSite-start-interval-0s
>
>           interval=0s timeout=40s
>
>         stop: WebSite-stop-interval-0s
>
>           interval=0s timeout=60s
>
>
>
> Colocation Constraints:
>
>   resource 'FS-clone' with Promoted resource 'Data-clone' (id:
> colocation-FS-
>
>       Data-clone-INFINITY)
>
>     score=INFINITY
>
>   resource 'WebSite-clone' with resource 'FS-clone' (id:
> colocation-WebSite-FS-
>
>       INFINITY)
>
>     score=INFINITY
>
>   resource 'FS-clone' with resource 'dlm-clone' (id:
> colocation-FS-dlm-clone-
>
>       INFINITY)
>
>     score=INFINITY
>
>   resource 'FS-clone' with resource 'smtp_postfix-clone' (id:
> colocation-FS-
>
>       clone-smtp_postfix-clone-INFINITY)
>
>     score=INFINITY
>
> Order Constraints:
>
>   promote resource 'Data-clone' then start resource 'FS-clone' (id:
> order-Data-
>
>       clone-FS-mandatory)
>
>   start resource 'FS-clone' then start resource 'WebSite-clone' (id:
> order-FS-
>
>       WebSite-mandatory)
>
>   start resource 'dlm-clone' then start resource 'FS-clone' (id: order-dlm-
>
>       clone-FS-mandatory)
>
>   start resource 'FS-clone' then start resource 'smtp_postfix-clone' (id:
> order-
>
>       FS-clone-smtp_postfix-clone-mandatory)
>
>
>
> Resources Defaults:
>
>   Meta Attrs: build-resource-defaults
>
>     resource-stickiness=1 (id: build-resource-stickiness)
>
>
>
> Operations Defaults:
>
>   Meta Attrs: op_defaults-meta_attributes
>
>     timeout=240s (id: op_defaults-meta_attributes-timeout)
>
>
>
> Cluster Properties: cib-bootstrap-options
>
>   cluster-infrastructure=corosync
>
>   cluster-name=mycluster
>
>   dc-version=2.1.6-9.el9-6fdc9deea29
>
>   have-watchdog=true
>
>   last-lrm-refresh=1701787695
>
>   no-quorum-policy=ignore
>
>   stonith-enabled=true
>
>   stonith-watchdog-timeout=10
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> And this is my DRBD configuration :
>
>
>
> global {
>
>   usage-count no;
>
> }
>
> common {
>
>   disk {
>
>     resync-rate 100M;
>
>     al-extents 257;
>
>   }
>
> }
>
> resource drbd0 {
>
>   protocol C;
>
>     handlers {
>
>       pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
>
>       pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
>
>       local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
> halt;
>
>      fence-peer        "/usr/lib/drbd/crm-fence-peer.9.sh";
>
>       after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh";
>
>       split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>
>       out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
>
>     }
>
>     startup {
>
>       wfc-timeout       1;
>
>       degr-wfc-timeout  1;
>
>       become-primary-on both;
>
>     }
>
>     net {
>
>       # The following lines are dedicated to handle
>
>       # split-brain situations (e.g., if one of the nodes fails)
>
>       after-sb-0pri discard-zero-changes; # If both nodes are secondary,
> just make one of them primary
>
>       after-sb-1pri discard-secondary; # If one is primary, one is not,
> trust the primary node
>
>       after-sb-2pri     disconnect;
>
>       allow-two-primaries yes;
>
>       verify-alg        sha1;
>
>     }
>
>     disk {
>
>       on-io-error detach;
>
>     }
>
>     options {
>
>       auto-promote yes;
>
>     }
>
>     on fradevtestmail1 {
>
>        device /dev/drbd0;
>
>        disk /dev/rootvg/drbdlv;
>
>        address X.X.X.X:7788;
>
>        flexible-meta-disk internal;
>
>     }
>
>     on fradevtestmail2 {
>
>        device /dev/drbd0;
>
>        disk /dev/rootvg/drbdlv;
>
>        address X.X.X.X:7788;
>
>        flexible-meta-disk internal;
>
>     }
>
> }
>
>
>
>
>
> Knowing all this,
>
>
>
> The cluster works perfectly as expected when both nodes are up, but a
> problem arises when I put the cluster in a degraded state by killing one of
> the nodes improperly (to simulate an unexpected crash).
>
>
>
> This causes the remaining node to reboot, restart the cluster, with all
> going well in the resource start process, until it's time to mount the File
> System, where it times out and fails.
>
>
>
> Would you have any idea why this behaviour happens, and possibly how I
> would be able to fix this behaviour, so that the cluster is still usable
> even with one node down?
>
> Until we can get the second node back and running in case of an unexpected
> crash ?
>
>
>
> Many thanks for your help,
>
>
>
> Have a nice day,
>
>
>
> BR,
>
>
>
>
>
>
>
> [image: SOGET]
>
> *Raphael DUBOIS-LISKI*
> *Ingénieur Système et Réseau*
> +33 2 35 19 25 54
>
> SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR
>
> [image: web]
> <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09>
>
> [image: linkedin]
> <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09>
>
> [image: twitter]
> <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09>
>
> Disclaimer <http://soget.fr/disclaimer>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to