On 05/24/2018 06:19 AM, Andrei Borzenkov wrote: > 24.05.2018 02:57, Jason Gauthier пишет: >> I'm fairly new to clustering under Linux. I've basically have one shared >> storage resource right now, using dlm, and gfs2. >> I'm using fibre channel and when both of my nodes are up (2 node cluster) >> dlm and gfs2 seem to be operating perfectly. >> If I reboot node B, node A works fine and vice-versa. >> >> When node B goes offline unexpectedly, and become unclean, dlm seems to >> block all IO to the shared storage. >> >> dlm knows node B is down: >> >> # dlm_tool status >> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644 >> daemon now 865695 fence_pid 18186 >> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence >> 0 now 1527119524 >> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0 >> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0 >> >> on the same server, I see these messages in my daemon.log >> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick >> (reboot) node 1084772369/(null) : No route to host (-113) >> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid >> 1084772369 >> >> I can recover from the situation by forcing it (or bring the other node >> back online) >> dlm_tool fence_ack 1084772369 >> >> cluster config is pretty straighforward. >> node 1084772368: alpha >> node 1084772369: beta >> primitive p_dlm_controld ocf:pacemaker:controld \ >> op monitor interval=60 timeout=60 \ >> meta target-role=Started \ >> params args="-K -L -s 1" >> primitive p_fs_gfs2 Filesystem \ >> params device="/dev/sdb2" directory="/vms" fstype=gfs2 >> primitive stonith_sbd stonith:external/sbd \ >> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \ >> meta target-role=Started > What is the status of stonith resource? Did you configure SBD fencing > properly? Is sbd daemon up and running with proper parameters? What is > output of > > sbd -d /dev/sdb1 dump > sbd -d /dev/sdb1 list > > on both nodes? Does > > sbd -d /dev/sdb1 message <other-node> test > > work in both directions? > > Does manual fencing using stonith_admin work?
And checkout that your sbd (1.3.1 to be on the safe side) is new enough otherwise it won't work properly with 2-node enabled in corosync. But this wouldn't describe your problem - would rather be the other way round like still giving you access to the device while it might not be assured that the sbd-fenced node would properly watchdog-suicide in case that it looses access to the storage. Regards, Klaus > >> group g_gfs2 p_dlm_controld p_fs_gfs2 >> clone cl_gfs2 g_gfs2 \ >> meta interleave=true target-role=Started >> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha >> property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.16-94ff4df \ >> cluster-infrastructure=corosync \ >> cluster-name=zeta \ >> last-lrm-refresh=1525523370 \ >> stonith-enabled=true \ >> stonith-timeout=20s >> >> Any pointers would be appreciated. I feel like this should be working but >> I'm not sure if I've missed something. >> >> Thanks, >> >> Jason >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org