Re: [ClusterLabs] resource fails manual failover

2023-12-12 Thread Ken Gaillot
On Tue, 2023-12-12 at 16:50 +0300, Artem wrote:
> Is there a detailed explanation for resource monitor and start
> timeouts and intervals with examples, for dummies?

No, though Pacemaker Explained has some reference information:

https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/index.html#resource-operations

> 
> my resource configured s follows:
> [root@lustre-mds1 ~]# pcs resource show MDT00
> Warning: This command is deprecated and will be removed. Please use
> 'pcs resource config' instead.
> Resource: MDT00 (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: MDT00-instance_attributes
> device=/dev/mapper/mds00
> directory=/lustre/mds00
> force_unmount=safe
> fstype=lustre
>   Operations:
> monitor: MDT00-monitor-interval-20s
>   interval=20s
>   timeout=40s
> start: MDT00-start-interval-0s
>   interval=0s
>   timeout=60s
> stop: MDT00-stop-interval-0s
>   interval=0s
>   timeout=60s
> 
> I issued manual failover with the following commands:
> crm_resource --move -r MDT00 -H lustre-mds1
> 
> resource tried but returned back with the entries in pacemaker.log
> like these:
> Dec 12 15:53:23  Filesystem(MDT00)[1886100]:INFO: Running start
> for /dev/mapper/mds00 on /lustre/mds00
> Dec 12 15:53:45  Filesystem(MDT00)[1886100]:ERROR: Couldn't mount
> device [/dev/mapper/mds00] as /lustre/mds00
> 
> tried again with the same result:
> Dec 12 16:11:04  Filesystem(MDT00)[1891333]:INFO: Running start
> for /dev/mapper/mds00 on /lustre/mds00
> Dec 12 16:11:26  Filesystem(MDT00)[1891333]:ERROR: Couldn't mount
> device [/dev/mapper/mds00] as /lustre/mds00
> 
> Why it cannot move?

The error is outside the cluster software, in the mount attempt itself.
The resource agent logged the ERROR above, so if you can't find more
information in the system logs you may want to look at the agent code
to see what it's doing around that message.

> 
> Does this 20 sec interval (between start and error) have anything to
> do with monitor interval settings?

No. The monitor interval says when to schedule another recurring
monitor check after the previous one completes. The first monitor isn't
scheduled until after the start succeeds.

> 
> [root@lustre-mgs ~]# pcs constraint show --full
> Location Constraints:
>   Resource: MDT00
> Enabled on:
>   Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-
> 100)
>   Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-
> 100)
> Disabled on:
>   Node: lustre-mgs (score:-INFINITY) (id:location-MDT00-lustre-
> mgs--INFINITY)
>   Node: lustre1 (score:-INFINITY) (id:location-MDT00-lustre1
> --INFINITY)
>   Node: lustre2 (score:-INFINITY) (id:location-MDT00-lustre2
> --INFINITY)
>   Node: lustre3 (score:-INFINITY) (id:location-MDT00-lustre3
> --INFINITY)
>   Node: lustre4 (score:-INFINITY) (id:location-MDT00-lustre4
> --INFINITY)
> Ordering Constraints:
>   start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-
> Optional)
>   start MDT00 then start OST1 (kind:Optional) (id:order-MDT00-OST1-
> Optional)
>   start MDT00 then start OST2 (kind:Optional) (id:order-MDT00-OST2-
> Optional)
> 
> with regards to ordering constraint: OST1 and OST2 are started now,
> while I'm exercising MDT00 failover.
> 
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource fails manual failover

2023-12-12 Thread Andrei Borzenkov
On Tue, Dec 12, 2023 at 4:50 PM Artem  wrote:
>
> Is there a detailed explanation for resource monitor and start timeouts and 
> intervals with examples, for dummies?
>
> my resource configured s follows:
> [root@lustre-mds1 ~]# pcs resource show MDT00
> Warning: This command is deprecated and will be removed. Please use 'pcs 
> resource config' instead.
> Resource: MDT00 (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: MDT00-instance_attributes
> device=/dev/mapper/mds00
> directory=/lustre/mds00
> force_unmount=safe
> fstype=lustre
>   Operations:
> monitor: MDT00-monitor-interval-20s
>   interval=20s
>   timeout=40s
> start: MDT00-start-interval-0s
>   interval=0s
>   timeout=60s
> stop: MDT00-stop-interval-0s
>   interval=0s
>   timeout=60s
>
> I issued manual failover with the following commands:
> crm_resource --move -r MDT00 -H lustre-mds1
>
> resource tried but returned back with the entries in pacemaker.log like these:
> Dec 12 15:53:23  Filesystem(MDT00)[1886100]:INFO: Running start for 
> /dev/mapper/mds00 on /lustre/mds00
> Dec 12 15:53:45  Filesystem(MDT00)[1886100]:ERROR: Couldn't mount device 
> [/dev/mapper/mds00] as /lustre/mds00
>
> tried again with the same result:
> Dec 12 16:11:04  Filesystem(MDT00)[1891333]:INFO: Running start for 
> /dev/mapper/mds00 on /lustre/mds00
> Dec 12 16:11:26  Filesystem(MDT00)[1891333]:ERROR: Couldn't mount device 
> [/dev/mapper/mds00] as /lustre/mds00
>
> Why it cannot move?
>

Because it failed to start this resource on the node selected to run
this resource. Maybe the device is missing, maybe the mount point is
missing, maybe something else.

> Does this 20 sec interval (between start and error) have anything to do with 
> monitor interval settings?
>
> [root@lustre-mgs ~]# pcs constraint show --full
> Location Constraints:
>   Resource: MDT00
> Enabled on:
>   Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-100)
>   Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-100)
> Disabled on:
>   Node: lustre-mgs (score:-INFINITY) 
> (id:location-MDT00-lustre-mgs--INFINITY)
>   Node: lustre1 (score:-INFINITY) (id:location-MDT00-lustre1--INFINITY)
>   Node: lustre2 (score:-INFINITY) (id:location-MDT00-lustre2--INFINITY)
>   Node: lustre3 (score:-INFINITY) (id:location-MDT00-lustre3--INFINITY)
>   Node: lustre4 (score:-INFINITY) (id:location-MDT00-lustre4--INFINITY)
> Ordering Constraints:
>   start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-Optional)
>   start MDT00 then start OST1 (kind:Optional) (id:order-MDT00-OST1-Optional)
>   start MDT00 then start OST2 (kind:Optional) (id:order-MDT00-OST2-Optional)
>
> with regards to ordering constraint: OST1 and OST2 are started now, while I'm 
> exercising MDT00 failover.
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] resource fails manual failover

2023-12-12 Thread Artem
Is there a detailed explanation for resource monitor and start timeouts and
intervals with examples, for dummies?

my resource configured s follows:
[root@lustre-mds1 ~]# pcs resource show MDT00
Warning: This command is deprecated and will be removed. Please use 'pcs
resource config' instead.
Resource: MDT00 (class=ocf provider=heartbeat type=Filesystem)
  Attributes: MDT00-instance_attributes
device=/dev/mapper/mds00
directory=/lustre/mds00
force_unmount=safe
fstype=lustre
  Operations:
monitor: MDT00-monitor-interval-20s
  interval=20s
  timeout=40s
start: MDT00-start-interval-0s
  interval=0s
  timeout=60s
stop: MDT00-stop-interval-0s
  interval=0s
  timeout=60s

I issued manual failover with the following commands:
crm_resource --move -r MDT00 -H lustre-mds1

resource tried but returned back with the entries in pacemaker.log like
these:
Dec 12 15:53:23  Filesystem(MDT00)[1886100]:INFO: Running start for
/dev/mapper/mds00 on /lustre/mds00
Dec 12 15:53:45  Filesystem(MDT00)[1886100]:ERROR: Couldn't mount
device [/dev/mapper/mds00] as /lustre/mds00

tried again with the same result:
Dec 12 16:11:04  Filesystem(MDT00)[1891333]:INFO: Running start for
/dev/mapper/mds00 on /lustre/mds00
Dec 12 16:11:26  Filesystem(MDT00)[1891333]:ERROR: Couldn't mount
device [/dev/mapper/mds00] as /lustre/mds00

Why it cannot move?

Does this 20 sec interval (between start and error) have anything to do
with monitor interval settings?

[root@lustre-mgs ~]# pcs constraint show --full
Location Constraints:
  Resource: MDT00
Enabled on:
  Node: lustre-mds1 (score:100) (id:location-MDT00-lustre-mds1-100)
  Node: lustre-mds2 (score:100) (id:location-MDT00-lustre-mds2-100)
Disabled on:
  Node: lustre-mgs (score:-INFINITY)
(id:location-MDT00-lustre-mgs--INFINITY)
  Node: lustre1 (score:-INFINITY) (id:location-MDT00-lustre1--INFINITY)
  Node: lustre2 (score:-INFINITY) (id:location-MDT00-lustre2--INFINITY)
  Node: lustre3 (score:-INFINITY) (id:location-MDT00-lustre3--INFINITY)
  Node: lustre4 (score:-INFINITY) (id:location-MDT00-lustre4--INFINITY)
Ordering Constraints:
  start MGT then start MDT00 (kind:Optional) (id:order-MGT-MDT00-Optional)
  start MDT00 then start OST1 (kind:Optional) (id:order-MDT00-OST1-Optional)
  start MDT00 then start OST2 (kind:Optional) (id:order-MDT00-OST2-Optional)

with regards to ordering constraint: OST1 and OST2 are started now, while
I'm exercising MDT00 failover.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/