Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Jason Gauthier
On Thu, May 24, 2018 at 10:40 AM, Ken Gaillot  wrote:
> On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote:
>> On 05/24/2018 04:03 PM, Ken Gaillot wrote:
>> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov > > > il.c
>> > > om> wrote:
>> > > > 24.05.2018 02:57, Jason Gauthier пишет:
>> > > > > I'm fairly new to clustering under Linux.  I've basically
>> > > > > have
>> > > > > one shared
>> > > > > storage resource  right now, using dlm, and gfs2.
>> > > > > I'm using fibre channel and when both of my nodes are up (2
>> > > > > node
>> > > > > cluster)
>> > > > > dlm and gfs2 seem to be operating perfectly.
>> > > > > If I reboot node B, node A works fine and vice-versa.
>> > > > >
>> > > > > When node B goes offline unexpectedly, and become unclean,
>> > > > > dlm
>> > > > > seems to
>> > > > > block all IO to the shared storage.
>> > > > >
>> > > > > dlm knows node B is down:
>> > > > >
>> > > > > # dlm_tool status
>> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> > > > > daemon now 865695 fence_pid 18186
>> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
>> > > > > 1527119246 fence
>> > > > > 0 now 1527119524
>> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0
>> > > > > at 0
>> > > > > 0
>> > > > >
>> > > > > on the same server, I see these messages in my daemon.log
>> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick:
>> > > > > Could
>> > > > > not kick
>> > > > > (reboot) node 1084772369/(null) : No route to host (-113)
>> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error
>> > > > > -113
>> > > > > nodeid
>> > > > > 1084772369
>> > > > >
>> > > > > I can recover from the situation by forcing it (or bring the
>> > > > > other node
>> > > > > back online)
>> > > > > dlm_tool fence_ack 1084772369
>> > > > >
>> > > > > cluster config is pretty straighforward.
>> > > > > node 1084772368: alpha
>> > > > > node 1084772369: beta
>> > > > > primitive p_dlm_controld ocf:pacemaker:controld \
>> > > > > op monitor interval=60 timeout=60 \
>> > > > > meta target-role=Started \
>> > > > > params args="-K -L -s 1"
>> > > > > primitive p_fs_gfs2 Filesystem \
>> > > > > params device="/dev/sdb2" directory="/vms"
>> > > > > fstype=gfs2
>> > > > > primitive stonith_sbd stonith:external/sbd \
>> > > > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> > > > > meta target-role=Started
>> > > >
>> > > > What is the status of stonith resource? Did you configure SBD
>> > > > fencing
>> > > > properly?
>> > >
>> > > I believe so.  It's shown above in my cluster config.
>> > >
>> > > > Is sbd daemon up and running with proper parameters?
>> > >
>> > > Well, no, apparently sbd isn't running.With dlm, and gfs2,
>> > > the
>> > > cluster controls handling launching of the daemons.
>> > > I assumed the same here, since the resource shows that it is up.
>> >
>> > Unlike other services, sbd must be up before the cluster starts in
>> > order for the cluster to use it properly. (Notice the "have-
>> > watchdog=false" in your cib-bootstrap-options ... that means the
>> > cluster didn't find sbd running.)
>> >
>> > Also, even storage-based sbd requires a working hardware watchdog
>> > for
>> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd
>> > should
>> > list the watchdog device. Also sbd_device in your cluster config
>> > should
>> > match SBD_DEVICE in /etc/sysconfig/sbd.
>> >
>> > If you want the cluster to recover services elsewhere after a node
>> > self-fences (which I'm sure you do), you also need to set the
>> > stonith-
>> > watchdog-timeout cluster property to something greater than the
>> > value
>> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will
>> > wait
>> > that long and then assume the node fenced itself.

Thanks.  So, for whatever reason, sbd was not running. I went ahead
and got /etc/default/sbd (debian) configured.
I can't start the service manually due to dependencies, but I rebooted
node B and it came up.
Node A would not, I ended up rebooting both nodes at the same time,
and sbd was running on both.

I forced a failure of node B, and after a few seconds node A was able
to access the shared storage.
Definite improvement!



>> Actually for the case that there is a shared disk a successful
>> fencing-attempt via the sbd-fencing-resource should be enough
>> for the node to be assumed down.
>> In case of a 2-node-setup I would even discourage setting
>> stonith-watchdog-timeout as we need a real quorum-mechanism
>> for that to work.
>
> Ah, thanks -- I've updated the wiki how-to, feel free to clarify
> further:
>
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker
>
>>
>> Regards,
>> Klaus
>>
>> >
>> > > Online: [ alpha beta ]
>> > >
>> > > Full 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Ken Gaillot
On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote:
> On 05/24/2018 04:03 PM, Ken Gaillot wrote:
> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  > > il.c
> > > om> wrote:
> > > > 24.05.2018 02:57, Jason Gauthier пишет:
> > > > > I'm fairly new to clustering under Linux.  I've basically
> > > > > have
> > > > > one shared
> > > > > storage resource  right now, using dlm, and gfs2.
> > > > > I'm using fibre channel and when both of my nodes are up (2
> > > > > node
> > > > > cluster)
> > > > > dlm and gfs2 seem to be operating perfectly.
> > > > > If I reboot node B, node A works fine and vice-versa.
> > > > > 
> > > > > When node B goes offline unexpectedly, and become unclean,
> > > > > dlm
> > > > > seems to
> > > > > block all IO to the shared storage.
> > > > > 
> > > > > dlm knows node B is down:
> > > > > 
> > > > > # dlm_tool status
> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> > > > > daemon now 865695 fence_pid 18186
> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
> > > > > 1527119246 fence
> > > > > 0 now 1527119524
> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0
> > > > > at 0
> > > > > 0
> > > > > 
> > > > > on the same server, I see these messages in my daemon.log
> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick:
> > > > > Could
> > > > > not kick
> > > > > (reboot) node 1084772369/(null) : No route to host (-113)
> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error
> > > > > -113
> > > > > nodeid
> > > > > 1084772369
> > > > > 
> > > > > I can recover from the situation by forcing it (or bring the
> > > > > other node
> > > > > back online)
> > > > > dlm_tool fence_ack 1084772369
> > > > > 
> > > > > cluster config is pretty straighforward.
> > > > > node 1084772368: alpha
> > > > > node 1084772369: beta
> > > > > primitive p_dlm_controld ocf:pacemaker:controld \
> > > > > op monitor interval=60 timeout=60 \
> > > > > meta target-role=Started \
> > > > > params args="-K -L -s 1"
> > > > > primitive p_fs_gfs2 Filesystem \
> > > > > params device="/dev/sdb2" directory="/vms"
> > > > > fstype=gfs2
> > > > > primitive stonith_sbd stonith:external/sbd \
> > > > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> > > > > meta target-role=Started
> > > > 
> > > > What is the status of stonith resource? Did you configure SBD
> > > > fencing
> > > > properly?
> > > 
> > > I believe so.  It's shown above in my cluster config.
> > > 
> > > > Is sbd daemon up and running with proper parameters?
> > > 
> > > Well, no, apparently sbd isn't running.With dlm, and gfs2,
> > > the
> > > cluster controls handling launching of the daemons.
> > > I assumed the same here, since the resource shows that it is up.
> > 
> > Unlike other services, sbd must be up before the cluster starts in
> > order for the cluster to use it properly. (Notice the "have-
> > watchdog=false" in your cib-bootstrap-options ... that means the
> > cluster didn't find sbd running.)
> > 
> > Also, even storage-based sbd requires a working hardware watchdog
> > for
> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd
> > should
> > list the watchdog device. Also sbd_device in your cluster config
> > should
> > match SBD_DEVICE in /etc/sysconfig/sbd.
> > 
> > If you want the cluster to recover services elsewhere after a node
> > self-fences (which I'm sure you do), you also need to set the
> > stonith-
> > watchdog-timeout cluster property to something greater than the
> > value
> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will
> > wait
> > that long and then assume the node fenced itself.
> 
> Actually for the case that there is a shared disk a successful
> fencing-attempt via the sbd-fencing-resource should be enough
> for the node to be assumed down.
> In case of a 2-node-setup I would even discourage setting
> stonith-watchdog-timeout as we need a real quorum-mechanism
> for that to work.

Ah, thanks -- I've updated the wiki how-to, feel free to clarify
further:

https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker

> 
> Regards,
> Klaus
>  
> > 
> > > Online: [ alpha beta ]
> > > 
> > > Full list of resources:
> > > 
> > >  stonith_sbd(stonith:external/sbd): Started alpha
> > >  Clone Set: cl_gfs2 [g_gfs2]
> > >  Started: [ alpha beta ]
> > > 
> > > 
> > > > What is output of
> > > > sbd -d /dev/sdb1 dump
> > > > sbd -d /dev/sdb1 list
> > > 
> > > Both nodes seem fine.
> > > 
> > > 0   alpha   testbeta
> > > 1   betatestalpha
> > > 
> > > 
> > > > on both nodes? Does
> > > > 
> > > > sbd -d /dev/sdb1 message  test
> > > > 
> > > > work in both directions?
> > > 
> > > It doesn't return an error, yet without a daemon running, I don't
> > > think the message is 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Klaus Wenninger
On 05/24/2018 04:03 PM, Ken Gaillot wrote:
> On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov > om> wrote:
>>> 24.05.2018 02:57, Jason Gauthier пишет:
 I'm fairly new to clustering under Linux.  I've basically have
 one shared
 storage resource  right now, using dlm, and gfs2.
 I'm using fibre channel and when both of my nodes are up (2 node
 cluster)
 dlm and gfs2 seem to be operating perfectly.
 If I reboot node B, node A works fine and vice-versa.

 When node B goes offline unexpectedly, and become unclean, dlm
 seems to
 block all IO to the shared storage.

 dlm knows node B is down:

 # dlm_tool status
 cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
 daemon now 865695 fence_pid 18186
 fence 1084772369 nodedown pid 18186 actor 1084772368 fail
 1527119246 fence
 0 now 1527119524
 node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
 node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
 0

 on the same server, I see these messages in my daemon.log
 May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
 not kick
 (reboot) node 1084772369/(null) : No route to host (-113)
 May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
 nodeid
 1084772369

 I can recover from the situation by forcing it (or bring the
 other node
 back online)
 dlm_tool fence_ack 1084772369

 cluster config is pretty straighforward.
 node 1084772368: alpha
 node 1084772369: beta
 primitive p_dlm_controld ocf:pacemaker:controld \
 op monitor interval=60 timeout=60 \
 meta target-role=Started \
 params args="-K -L -s 1"
 primitive p_fs_gfs2 Filesystem \
 params device="/dev/sdb2" directory="/vms" fstype=gfs2
 primitive stonith_sbd stonith:external/sbd \
 params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
 meta target-role=Started
>>> What is the status of stonith resource? Did you configure SBD
>>> fencing
>>> properly?
>> I believe so.  It's shown above in my cluster config.
>>
>>> Is sbd daemon up and running with proper parameters?
>> Well, no, apparently sbd isn't running.With dlm, and gfs2, the
>> cluster controls handling launching of the daemons.
>> I assumed the same here, since the resource shows that it is up.
> Unlike other services, sbd must be up before the cluster starts in
> order for the cluster to use it properly. (Notice the "have-
> watchdog=false" in your cib-bootstrap-options ... that means the
> cluster didn't find sbd running.)
>
> Also, even storage-based sbd requires a working hardware watchdog for
> the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
> list the watchdog device. Also sbd_device in your cluster config should
> match SBD_DEVICE in /etc/sysconfig/sbd.
>
> If you want the cluster to recover services elsewhere after a node
> self-fences (which I'm sure you do), you also need to set the stonith-
> watchdog-timeout cluster property to something greater than the value
> of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
> that long and then assume the node fenced itself.

Actually for the case that there is a shared disk a successful
fencing-attempt via the sbd-fencing-resource should be enough
for the node to be assumed down.
In case of a 2-node-setup I would even discourage setting
stonith-watchdog-timeout as we need a real quorum-mechanism
for that to work.

Regards,
Klaus
 
>
>> Online: [ alpha beta ]
>>
>> Full list of resources:
>>
>>  stonith_sbd(stonith:external/sbd): Started alpha
>>  Clone Set: cl_gfs2 [g_gfs2]
>>  Started: [ alpha beta ]
>>
>>
>>> What is output of
>>> sbd -d /dev/sdb1 dump
>>> sbd -d /dev/sdb1 list
>> Both nodes seem fine.
>>
>> 0   alpha   testbeta
>> 1   betatestalpha
>>
>>
>>> on both nodes? Does
>>>
>>> sbd -d /dev/sdb1 message  test
>>>
>>> work in both directions?
>> It doesn't return an error, yet without a daemon running, I don't
>> think the message is received either.
>>
>>
>>> Does manual fencing using stonith_admin work?
>> I'm not sure at the moment.  I think I need to look into why the
>> daemon isn't running.
>>
 group g_gfs2 p_dlm_controld p_fs_gfs2
 clone cl_gfs2 g_gfs2 \
 meta interleave=true target-role=Started
 location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
 property cib-bootstrap-options: \
 have-watchdog=false \
 dc-version=1.1.16-94ff4df \
 cluster-infrastructure=corosync \
 cluster-name=zeta \
 last-lrm-refresh=1525523370 \
 stonith-enabled=true \
 stonith-timeout=20s

 Any pointers would be appreciated. I feel like this should be
 working but
 I'm not sure if I've missed 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Ken Gaillot
On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  om> wrote:
> > 24.05.2018 02:57, Jason Gauthier пишет:
> > > I'm fairly new to clustering under Linux.  I've basically have
> > > one shared
> > > storage resource  right now, using dlm, and gfs2.
> > > I'm using fibre channel and when both of my nodes are up (2 node
> > > cluster)
> > > dlm and gfs2 seem to be operating perfectly.
> > > If I reboot node B, node A works fine and vice-versa.
> > > 
> > > When node B goes offline unexpectedly, and become unclean, dlm
> > > seems to
> > > block all IO to the shared storage.
> > > 
> > > dlm knows node B is down:
> > > 
> > > # dlm_tool status
> > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> > > daemon now 865695 fence_pid 18186
> > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
> > > 1527119246 fence
> > > 0 now 1527119524
> > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
> > > 0
> > > 
> > > on the same server, I see these messages in my daemon.log
> > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
> > > not kick
> > > (reboot) node 1084772369/(null) : No route to host (-113)
> > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
> > > nodeid
> > > 1084772369
> > > 
> > > I can recover from the situation by forcing it (or bring the
> > > other node
> > > back online)
> > > dlm_tool fence_ack 1084772369
> > > 
> > > cluster config is pretty straighforward.
> > > node 1084772368: alpha
> > > node 1084772369: beta
> > > primitive p_dlm_controld ocf:pacemaker:controld \
> > > op monitor interval=60 timeout=60 \
> > > meta target-role=Started \
> > > params args="-K -L -s 1"
> > > primitive p_fs_gfs2 Filesystem \
> > > params device="/dev/sdb2" directory="/vms" fstype=gfs2
> > > primitive stonith_sbd stonith:external/sbd \
> > > params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> > > meta target-role=Started
> > 
> > What is the status of stonith resource? Did you configure SBD
> > fencing
> > properly?
> 
> I believe so.  It's shown above in my cluster config.
> 
> > Is sbd daemon up and running with proper parameters?
> 
> Well, no, apparently sbd isn't running.With dlm, and gfs2, the
> cluster controls handling launching of the daemons.
> I assumed the same here, since the resource shows that it is up.

Unlike other services, sbd must be up before the cluster starts in
order for the cluster to use it properly. (Notice the "have-
watchdog=false" in your cib-bootstrap-options ... that means the
cluster didn't find sbd running.)

Also, even storage-based sbd requires a working hardware watchdog for
the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
list the watchdog device. Also sbd_device in your cluster config should
match SBD_DEVICE in /etc/sysconfig/sbd.

If you want the cluster to recover services elsewhere after a node
self-fences (which I'm sure you do), you also need to set the stonith-
watchdog-timeout cluster property to something greater than the value
of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
that long and then assume the node fenced itself.

> 
> Online: [ alpha beta ]
> 
> Full list of resources:
> 
>  stonith_sbd(stonith:external/sbd): Started alpha
>  Clone Set: cl_gfs2 [g_gfs2]
>  Started: [ alpha beta ]
> 
> 
> > What is output of
> > sbd -d /dev/sdb1 dump
> > sbd -d /dev/sdb1 list
> 
> Both nodes seem fine.
> 
> 0   alpha   testbeta
> 1   betatestalpha
> 
> 
> > on both nodes? Does
> > 
> > sbd -d /dev/sdb1 message  test
> > 
> > work in both directions?
> 
> It doesn't return an error, yet without a daemon running, I don't
> think the message is received either.
> 
> 
> > Does manual fencing using stonith_admin work?
> 
> I'm not sure at the moment.  I think I need to look into why the
> daemon isn't running.
> 
> > > group g_gfs2 p_dlm_controld p_fs_gfs2
> > > clone cl_gfs2 g_gfs2 \
> > > meta interleave=true target-role=Started
> > > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
> > > property cib-bootstrap-options: \
> > > have-watchdog=false \
> > > dc-version=1.1.16-94ff4df \
> > > cluster-infrastructure=corosync \
> > > cluster-name=zeta \
> > > last-lrm-refresh=1525523370 \
> > > stonith-enabled=true \
> > > stonith-timeout=20s
> > > 
> > > Any pointers would be appreciated. I feel like this should be
> > > working but
> > > I'm not sure if I've missed something.
> > > 
> > > Thanks,
> > > 
> > > Jason
> > > 
> > > 
> > > 
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: 

Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Jason Gauthier
On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov  wrote:
> 24.05.2018 02:57, Jason Gauthier пишет:
>> I'm fairly new to clustering under Linux.  I've basically have one shared
>> storage resource  right now, using dlm, and gfs2.
>> I'm using fibre channel and when both of my nodes are up (2 node cluster)
>> dlm and gfs2 seem to be operating perfectly.
>> If I reboot node B, node A works fine and vice-versa.
>>
>> When node B goes offline unexpectedly, and become unclean, dlm seems to
>> block all IO to the shared storage.
>>
>> dlm knows node B is down:
>>
>> # dlm_tool status
>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> daemon now 865695 fence_pid 18186
>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
>> 0 now 1527119524
>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
>>
>> on the same server, I see these messages in my daemon.log
>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
>> (reboot) node 1084772369/(null) : No route to host (-113)
>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
>> 1084772369
>>
>> I can recover from the situation by forcing it (or bring the other node
>> back online)
>> dlm_tool fence_ack 1084772369
>>
>> cluster config is pretty straighforward.
>> node 1084772368: alpha
>> node 1084772369: beta
>> primitive p_dlm_controld ocf:pacemaker:controld \
>> op monitor interval=60 timeout=60 \
>> meta target-role=Started \
>> params args="-K -L -s 1"
>> primitive p_fs_gfs2 Filesystem \
>> params device="/dev/sdb2" directory="/vms" fstype=gfs2
>> primitive stonith_sbd stonith:external/sbd \
>> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> meta target-role=Started
>
> What is the status of stonith resource? Did you configure SBD fencing
> properly?

I believe so.  It's shown above in my cluster config.

> Is sbd daemon up and running with proper parameters?

Well, no, apparently sbd isn't running.With dlm, and gfs2, the
cluster controls handling launching of the daemons.
I assumed the same here, since the resource shows that it is up.

Online: [ alpha beta ]

Full list of resources:

 stonith_sbd(stonith:external/sbd): Started alpha
 Clone Set: cl_gfs2 [g_gfs2]
 Started: [ alpha beta ]


> What is output of
> sbd -d /dev/sdb1 dump
> sbd -d /dev/sdb1 list

Both nodes seem fine.

0   alpha   testbeta
1   betatestalpha


> on both nodes? Does
>
> sbd -d /dev/sdb1 message  test
>
> work in both directions?

It doesn't return an error, yet without a daemon running, I don't
think the message is received either.


> Does manual fencing using stonith_admin work?

I'm not sure at the moment.  I think I need to look into why the
daemon isn't running.

>> group g_gfs2 p_dlm_controld p_fs_gfs2
>> clone cl_gfs2 g_gfs2 \
>> meta interleave=true target-role=Started
>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-94ff4df \
>> cluster-infrastructure=corosync \
>> cluster-name=zeta \
>> last-lrm-refresh=1525523370 \
>> stonith-enabled=true \
>> stonith-timeout=20s
>>
>> Any pointers would be appreciated. I feel like this should be working but
>> I'm not sure if I've missed something.
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2018-05-24 Thread Klaus Wenninger
On 05/24/2018 06:19 AM, Andrei Borzenkov wrote:
> 24.05.2018 02:57, Jason Gauthier пишет:
>> I'm fairly new to clustering under Linux.  I've basically have one shared
>> storage resource  right now, using dlm, and gfs2.
>> I'm using fibre channel and when both of my nodes are up (2 node cluster)
>> dlm and gfs2 seem to be operating perfectly.
>> If I reboot node B, node A works fine and vice-versa.
>>
>> When node B goes offline unexpectedly, and become unclean, dlm seems to
>> block all IO to the shared storage.
>>
>> dlm knows node B is down:
>>
>> # dlm_tool status
>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> daemon now 865695 fence_pid 18186
>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
>> 0 now 1527119524
>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
>>
>> on the same server, I see these messages in my daemon.log
>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
>> (reboot) node 1084772369/(null) : No route to host (-113)
>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
>> 1084772369
>>
>> I can recover from the situation by forcing it (or bring the other node
>> back online)
>> dlm_tool fence_ack 1084772369
>>
>> cluster config is pretty straighforward.
>> node 1084772368: alpha
>> node 1084772369: beta
>> primitive p_dlm_controld ocf:pacemaker:controld \
>> op monitor interval=60 timeout=60 \
>> meta target-role=Started \
>> params args="-K -L -s 1"
>> primitive p_fs_gfs2 Filesystem \
>> params device="/dev/sdb2" directory="/vms" fstype=gfs2
>> primitive stonith_sbd stonith:external/sbd \
>> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> meta target-role=Started
> What is the status of stonith resource? Did you configure SBD fencing
> properly?  Is sbd daemon up and running with proper parameters? What is
> output of
>
> sbd -d /dev/sdb1 dump
> sbd -d /dev/sdb1 list
>
> on both nodes? Does
>
> sbd -d /dev/sdb1 message  test
>
> work in both directions?
>
> Does manual fencing using stonith_admin work?

And checkout that your sbd (1.3.1 to be on the safe side)
is new enough otherwise it won't work properly with 2-node
enabled in corosync.
But this wouldn't describe your problem - would rather
be the other way round like still giving you access to
the device while it might not be assured that the sbd-fenced
node would properly watchdog-suicide in case that it looses
access to the storage.

Regards,
Klaus
 
>
>> group g_gfs2 p_dlm_controld p_fs_gfs2
>> clone cl_gfs2 g_gfs2 \
>> meta interleave=true target-role=Started
>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-94ff4df \
>> cluster-infrastructure=corosync \
>> cluster-name=zeta \
>> last-lrm-refresh=1525523370 \
>> stonith-enabled=true \
>> stonith-timeout=20s
>>
>> Any pointers would be appreciated. I feel like this should be working but
>> I'm not sure if I've missed something.
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2018-05-23 Thread Andrei Borzenkov
24.05.2018 02:57, Jason Gauthier пишет:
> I'm fairly new to clustering under Linux.  I've basically have one shared
> storage resource  right now, using dlm, and gfs2.
> I'm using fibre channel and when both of my nodes are up (2 node cluster)
> dlm and gfs2 seem to be operating perfectly.
> If I reboot node B, node A works fine and vice-versa.
> 
> When node B goes offline unexpectedly, and become unclean, dlm seems to
> block all IO to the shared storage.
> 
> dlm knows node B is down:
> 
> # dlm_tool status
> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> daemon now 865695 fence_pid 18186
> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
> 0 now 1527119524
> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
> 
> on the same server, I see these messages in my daemon.log
> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
> (reboot) node 1084772369/(null) : No route to host (-113)
> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
> 1084772369
> 
> I can recover from the situation by forcing it (or bring the other node
> back online)
> dlm_tool fence_ack 1084772369
> 
> cluster config is pretty straighforward.
> node 1084772368: alpha
> node 1084772369: beta
> primitive p_dlm_controld ocf:pacemaker:controld \
> op monitor interval=60 timeout=60 \
> meta target-role=Started \
> params args="-K -L -s 1"
> primitive p_fs_gfs2 Filesystem \
> params device="/dev/sdb2" directory="/vms" fstype=gfs2
> primitive stonith_sbd stonith:external/sbd \
> params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> meta target-role=Started

What is the status of stonith resource? Did you configure SBD fencing
properly?  Is sbd daemon up and running with proper parameters? What is
output of

sbd -d /dev/sdb1 dump
sbd -d /dev/sdb1 list

on both nodes? Does

sbd -d /dev/sdb1 message  test

work in both directions?

Does manual fencing using stonith_admin work?

> group g_gfs2 p_dlm_controld p_fs_gfs2
> clone cl_gfs2 g_gfs2 \
> meta interleave=true target-role=Started
> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.16-94ff4df \
> cluster-infrastructure=corosync \
> cluster-name=zeta \
> last-lrm-refresh=1525523370 \
> stonith-enabled=true \
> stonith-timeout=20s
> 
> Any pointers would be appreciated. I feel like this should be working but
> I'm not sure if I've missed something.
> 
> Thanks,
> 
> Jason
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2016-03-29 Thread Eric Ren

Hello Ferenc,

Just want to communicate thoughts, AFAIC.



I've ment to explore this connection for long, but never found much
useful material on the subject.  How does DLM fencing fit into the
modern Pacemaker architecture?  Fencing is a confusing topic in itself


Yes, unfortunately, maybe the best material is source code so far, such 
as resource agent(ocf:pacemaker:controld) scripts 
(/usr/lib/ocf/resource.d/pacemaker/controld), and libdlm/dlm_controld/, etc.




already (fence_legacy, fence_pcmk, stonith, stonithd, stonith_admin),
then dlm_controld can use dlm_stonith to proxy fencing requests to
Pacemaker, and it becomes hopeless... :)


What you said about dlm_stonith is true. It just invoke an API of pcmk 
to tell pacemaker who and when should be fenced. Pacemaker do the horse 
work, according to what fencing method is used, I guess the fencing 
request will finally reach its destination - a resource agent for fencing.


I'm just starting learning about the way these components cooperate 
together. Could you share any updates if you've learn something? I would 
be grateful;-)




I'd be grateful for a pointer to a good overview document, or a quick
sketch if you can spare the time.  To invoke some concrete questions:
When does DLM fence a node?  Is it necessary only when there's no


When fencing here is limited within DLM, the time DLM will actively make 
request to fence is when uncontrolled lockspace has been found in 
kernel. Only a rebooting can make that node clean.



resource manager running on the cluster?  Does it matter whether
dlm_controld is run as a standalone daemon or as a controld resource?


According DLM's man pages and codes I've read, DLM provide us two 
options: daemon(I guess it's short of dlm_daemon) fence, and 
dlm_stonith. Daemon fencing is with DLM itself, which have 
configuration(man dlm.conf) and lots of 
code(dlm/dlm_controld/{fence*|daemon_cpg.c}) to handle fencing. But I 
never configured DLM fencing stuff, as defalt, it (may) use dlm_stonith 
as proxy, then pacemaker things...


So I think corosync that provide membership knowledge is a must for DLM, 
but pacemaker is optional if DLM fencing has been configured and you 
don't want other RA, which I also never tried;-)



Wouldn't Pacemaker fence a failing node itself all the same?  Or is
dlm_stonith for the case when only the stonithd component of Pacemaker
is active somehow?



Please correct me if there's any problem.

Eric

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DLM fencing

2016-02-08 Thread Digimer
On 08/02/16 03:55 PM, G Spot wrote:
> Hi Ken,
> 
> Am trying to create shared storage with clvm/gfs2 and when I try to
> fence I only see scsi option but my storage is conencted through FC is
> there any otherways can I fence my 1G stonith device other than scsi?

fencing of a lost node with clvmd/gfs2 is no different than normal
cluster fencing. To be clear, DLM does NOT fence, it simply waits for
the cluster to fence. So you can use IPMI, switched PDUs or whatever
else is available in your environment.

> On Mon, Feb 8, 2016 at 2:03 PM, Digimer  > wrote:
> 
> On 08/02/16 01:56 PM, Ferenc Wágner wrote:
> > Ken Gaillot > writes:
> >
> >> On 02/07/2016 12:21 AM, G Spot wrote:
> >>
> >>> Thanks for your response, am using ocf:pacemaker:controld resource
> >>> agent and stonith-enabled=false do I need to configure stonith device
> >>> to make this work?
> >>
> >> Correct. DLM requires access to fencing.
> >
> > I've ment to explore this connection for long, but never found much
> > useful material on the subject.  How does DLM fencing fit into the
> > modern Pacemaker architecture?  Fencing is a confusing topic in itself
> > already (fence_legacy, fence_pcmk, stonith, stonithd, stonith_admin),
> > then dlm_controld can use dlm_stonith to proxy fencing requests to
> > Pacemaker, and it becomes hopeless... :)
> >
> > I'd be grateful for a pointer to a good overview document, or a quick
> > sketch if you can spare the time.  To invoke some concrete questions:
> > When does DLM fence a node?  Is it necessary only when there's no
> > resource manager running on the cluster?  Does it matter whether
> > dlm_controld is run as a standalone daemon or as a controld resource?
> > Wouldn't Pacemaker fence a failing node itself all the same?  Or is
> > dlm_stonith for the case when only the stonithd component of Pacemaker
> > is active somehow?
> 
> DLM is a thing onto itself, and some tools like gfs2 and clustered-lvm
> use it to coordinate locking across the cluster. If a node drops out,
> the cluster informs dlm and it blocks until the lost node is confirmed
> fenced. Then it reaps the lost locks and recovery can begin.
> 
> If fencing fails or is not configured, DLM never unblocks and anything
> using it is left hung (by design, better to hang than risk corruption).
> 
> One of many reasons why fencing is critical.
> 
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org