On 05/24/2018 04:03 PM, Ken Gaillot wrote: > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote: >> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov <arvidjaar@gmail.c >> om> wrote: >>> 24.05.2018 02:57, Jason Gauthier пишет: >>>> I'm fairly new to clustering under Linux. I've basically have >>>> one shared >>>> storage resource right now, using dlm, and gfs2. >>>> I'm using fibre channel and when both of my nodes are up (2 node >>>> cluster) >>>> dlm and gfs2 seem to be operating perfectly. >>>> If I reboot node B, node A works fine and vice-versa. >>>> >>>> When node B goes offline unexpectedly, and become unclean, dlm >>>> seems to >>>> block all IO to the shared storage. >>>> >>>> dlm knows node B is down: >>>> >>>> # dlm_tool status >>>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644 >>>> daemon now 865695 fence_pid 18186 >>>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail >>>> 1527119246 fence >>>> 0 now 1527119524 >>>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0 >>>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 >>>> 0 >>>> >>>> on the same server, I see these messages in my daemon.log >>>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could >>>> not kick >>>> (reboot) node 1084772369/(null) : No route to host (-113) >>>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 >>>> nodeid >>>> 1084772369 >>>> >>>> I can recover from the situation by forcing it (or bring the >>>> other node >>>> back online) >>>> dlm_tool fence_ack 1084772369 >>>> >>>> cluster config is pretty straighforward. >>>> node 1084772368: alpha >>>> node 1084772369: beta >>>> primitive p_dlm_controld ocf:pacemaker:controld \ >>>> op monitor interval=60 timeout=60 \ >>>> meta target-role=Started \ >>>> params args="-K -L -s 1" >>>> primitive p_fs_gfs2 Filesystem \ >>>> params device="/dev/sdb2" directory="/vms" fstype=gfs2 >>>> primitive stonith_sbd stonith:external/sbd \ >>>> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \ >>>> meta target-role=Started >>> What is the status of stonith resource? Did you configure SBD >>> fencing >>> properly? >> I believe so. It's shown above in my cluster config. >> >>> Is sbd daemon up and running with proper parameters? >> Well, no, apparently sbd isn't running. With dlm, and gfs2, the >> cluster controls handling launching of the daemons. >> I assumed the same here, since the resource shows that it is up. > Unlike other services, sbd must be up before the cluster starts in > order for the cluster to use it properly. (Notice the "have- > watchdog=false" in your cib-bootstrap-options ... that means the > cluster didn't find sbd running.) > > Also, even storage-based sbd requires a working hardware watchdog for > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should > list the watchdog device. Also sbd_device in your cluster config should > match SBD_DEVICE in /etc/sysconfig/sbd. > > If you want the cluster to recover services elsewhere after a node > self-fences (which I'm sure you do), you also need to set the stonith- > watchdog-timeout cluster property to something greater than the value > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait > that long and then assume the node fenced itself.
Actually for the case that there is a shared disk a successful fencing-attempt via the sbd-fencing-resource should be enough for the node to be assumed down. In case of a 2-node-setup I would even discourage setting stonith-watchdog-timeout as we need a real quorum-mechanism for that to work. Regards, Klaus > >> Online: [ alpha beta ] >> >> Full list of resources: >> >> stonith_sbd (stonith:external/sbd): Started alpha >> Clone Set: cl_gfs2 [g_gfs2] >> Started: [ alpha beta ] >> >> >>> What is output of >>> sbd -d /dev/sdb1 dump >>> sbd -d /dev/sdb1 list >> Both nodes seem fine. >> >> 0 alpha test beta >> 1 beta test alpha >> >> >>> on both nodes? Does >>> >>> sbd -d /dev/sdb1 message <other-node> test >>> >>> work in both directions? >> It doesn't return an error, yet without a daemon running, I don't >> think the message is received either. >> >> >>> Does manual fencing using stonith_admin work? >> I'm not sure at the moment. I think I need to look into why the >> daemon isn't running. >> >>>> group g_gfs2 p_dlm_controld p_fs_gfs2 >>>> clone cl_gfs2 g_gfs2 \ >>>> meta interleave=true target-role=Started >>>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha >>>> property cib-bootstrap-options: \ >>>> have-watchdog=false \ >>>> dc-version=1.1.16-94ff4df \ >>>> cluster-infrastructure=corosync \ >>>> cluster-name=zeta \ >>>> last-lrm-refresh=1525523370 \ >>>> stonith-enabled=true \ >>>> stonith-timeout=20s >>>> >>>> Any pointers would be appreciated. I feel like this should be >>>> working but >>>> I'm not sure if I've missed something. >>>> >>>> Thanks, >>>> >>>> Jason >>>> >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra >>>> tch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc >>> h.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> pdf >> Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org