It could be the watchdog? Are u using diskless watchdog?Two nodes are not supported in diskless mode.
On Tue, Dec 5, 2023, 5:40 PM Raphael DUBOIS-LISKI < raphael.dubois-li...@soget.fr> wrote: > Hello, > > > > I am seeking help for the setup of an Active/Active pacemaker cluster that > relies on a DRBD cluster as the data storage backend, as the solution is > mounted on 2 RHEL9 VMs, the file system used is GFS2. > > > > Linked, is a PDF of the infrastructure that I am currently experimenting > on. > > > > For context, this is my Pacemaker cluster config: > > > > Cluster Name: mycluster > > Corosync Nodes: > > Node1 Node2 > > Pacemaker Nodes: > > Node1 Node2 > > > > Resources: > > Clone: Data-clone > > Meta Attributes: Data-clone-meta_attributes > > clone-max=2 > > clone-node-max=1 > > notify=true > > promotable=true > > promoted-max=2 > > promoted-node-max=1 > > Resource: Data (class=ocf provider=linbit type=drbd) > > Attributes: Data-instance_attributes > > drbd_resource=drbd0 > > Operations: > > demote: Data-demote-interval-0s > > interval=0s timeout=90 > > monitor: Data-monitor-interval-60s > > interval=60s > > notify: Data-notify-interval-0s > > interval=0s timeout=90 > > promote: Data-promote-interval-0s > > interval=0s timeout=90 > > reload: Data-reload-interval-0s > > interval=0s timeout=30 > > start: Data-start-interval-0s > > interval=0s timeout=240 > > stop: Data-stop-interval-0s > > interval=0s timeout=100 > > Clone: dlm-clone > > Meta Attributes: dlm-clone-meta_attributes > > clone-max=2 > > clone-node-max=1 > > Resource: dlm (class=ocf provider=pacemaker type=controld) > > Operations: > > monitor: dlm-monitor-interval-60s > > interval=60s > > start: dlm-start-interval-0s > > interval=0s timeout=90s > > stop: dlm-stop-interval-0s > > interval=0s timeout=100s > > Clone: FS-clone > > Resource: FS (class=ocf provider=heartbeat type=Filesystem) > > Attributes: FS-instance_attributes > > device=/dev/drbd0 > > directory=/home/vusers > > fstype=gfs2 > > Operations: > > monitor: FS-monitor-interval-20s > > interval=20s timeout=40s > > start: FS-start-interval-0s > > interval=0s timeout=60s > > stop: FS-stop-interval-0s > > interval=0s timeout=60s > > Clone: smtp_postfix-clone > > Meta Attributes: smtp_postfix-clone-meta_attributes > > clone-max=2 > > clone-node-max=1 > > Resource: smtp_postfix (class=ocf provider=heartbeat type=postfix) > > Operations: > > monitor: smtp_postfix-monitor-interval-60s > > interval=60s timeout=20s > > reload: smtp_postfix-reload-interval-0s > > interval=0s timeout=20s > > start: smtp_postfix-start-interval-0s > > interval=0s timeout=20s > > stop: smtp_postfix-stop-interval-0s > > interval=0s timeout=20s > > Clone: WebSite-clone > > Resource: WebSite (class=ocf provider=heartbeat type=apache) > > Attributes: WebSite-instance_attributes > > configfile=/etc/httpd/conf/httpd.conf > > statusurl=http://localhost/server-status > > Operations: > > monitor: WebSite-monitor-interval-1min > > interval=1min > > start: WebSite-start-interval-0s > > interval=0s timeout=40s > > stop: WebSite-stop-interval-0s > > interval=0s timeout=60s > > > > Colocation Constraints: > > resource 'FS-clone' with Promoted resource 'Data-clone' (id: > colocation-FS- > > Data-clone-INFINITY) > > score=INFINITY > > resource 'WebSite-clone' with resource 'FS-clone' (id: > colocation-WebSite-FS- > > INFINITY) > > score=INFINITY > > resource 'FS-clone' with resource 'dlm-clone' (id: > colocation-FS-dlm-clone- > > INFINITY) > > score=INFINITY > > resource 'FS-clone' with resource 'smtp_postfix-clone' (id: > colocation-FS- > > clone-smtp_postfix-clone-INFINITY) > > score=INFINITY > > Order Constraints: > > promote resource 'Data-clone' then start resource 'FS-clone' (id: > order-Data- > > clone-FS-mandatory) > > start resource 'FS-clone' then start resource 'WebSite-clone' (id: > order-FS- > > WebSite-mandatory) > > start resource 'dlm-clone' then start resource 'FS-clone' (id: order-dlm- > > clone-FS-mandatory) > > start resource 'FS-clone' then start resource 'smtp_postfix-clone' (id: > order- > > FS-clone-smtp_postfix-clone-mandatory) > > > > Resources Defaults: > > Meta Attrs: build-resource-defaults > > resource-stickiness=1 (id: build-resource-stickiness) > > > > Operations Defaults: > > Meta Attrs: op_defaults-meta_attributes > > timeout=240s (id: op_defaults-meta_attributes-timeout) > > > > Cluster Properties: cib-bootstrap-options > > cluster-infrastructure=corosync > > cluster-name=mycluster > > dc-version=2.1.6-9.el9-6fdc9deea29 > > have-watchdog=true > > last-lrm-refresh=1701787695 > > no-quorum-policy=ignore > > stonith-enabled=true > > stonith-watchdog-timeout=10 > > > > > > > > > > > > > > > > And this is my DRBD configuration : > > > > global { > > usage-count no; > > } > > common { > > disk { > > resync-rate 100M; > > al-extents 257; > > } > > } > > resource drbd0 { > > protocol C; > > handlers { > > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig; > > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig; > > local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; > halt; > > fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh"; > > after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh"; > > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > > out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > > } > > startup { > > wfc-timeout 1; > > degr-wfc-timeout 1; > > become-primary-on both; > > } > > net { > > # The following lines are dedicated to handle > > # split-brain situations (e.g., if one of the nodes fails) > > after-sb-0pri discard-zero-changes; # If both nodes are secondary, > just make one of them primary > > after-sb-1pri discard-secondary; # If one is primary, one is not, > trust the primary node > > after-sb-2pri disconnect; > > allow-two-primaries yes; > > verify-alg sha1; > > } > > disk { > > on-io-error detach; > > } > > options { > > auto-promote yes; > > } > > on fradevtestmail1 { > > device /dev/drbd0; > > disk /dev/rootvg/drbdlv; > > address X.X.X.X:7788; > > flexible-meta-disk internal; > > } > > on fradevtestmail2 { > > device /dev/drbd0; > > disk /dev/rootvg/drbdlv; > > address X.X.X.X:7788; > > flexible-meta-disk internal; > > } > > } > > > > > > Knowing all this, > > > > The cluster works perfectly as expected when both nodes are up, but a > problem arises when I put the cluster in a degraded state by killing one of > the nodes improperly (to simulate an unexpected crash). > > > > This causes the remaining node to reboot, restart the cluster, with all > going well in the resource start process, until it's time to mount the File > System, where it times out and fails. > > > > Would you have any idea why this behaviour happens, and possibly how I > would be able to fix this behaviour, so that the cluster is still usable > even with one node down? > > Until we can get the second node back and running in case of an unexpected > crash ? > > > > Many thanks for your help, > > > > Have a nice day, > > > > BR, > > > > > > > > [image: SOGET] > > *Raphael DUBOIS-LISKI* > *Ingénieur Système et Réseau* > +33 2 35 19 25 54 > > SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR > > [image: web] > <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09> > > [image: linkedin] > <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09> > > [image: twitter] > <https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09> > > Disclaimer <http://soget.fr/disclaimer> > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/