Re: [ClusterLabs] VIP monitoring failing with Timed Out error
On 29/10/15 14:10 +0100, Jan Pokorný wrote: > On 29/10/15 15:27 +0530, Pritam Kharat wrote: >> When I ran ocf-tester to test IPaddr2 agent >> >> ocf-tester -n sc_vip -o ip=192.168.20.188 -o cidr_netmask=24 -o nic=eth0 >> /usr/lib/ocf/resource.d/heartbeat/IPaddr2 >> >> I got this error - ERROR: Setup problem: couldn't find command: ip >> in test_command monitor. I verified ip command is there but still >> this error. What might be the reason for this error ? Is this okay ? >> >> + ip_validate >> + check_binary ip >> + have_binary ip >> + '[' 1 = 1 ']' >> + false > > It may be the case that you have the environment tainted with > a variable that should only be set in a special testing mode > injecting an error of the particular helper binary missing. > > Can you please try "unset OCF_TESTER_FAIL_HAVE_BINARY" to sanitize > your environment first? Indeed, if you don't have this variable set > for sure in the context of IPAddr2 agent invocations, the problem > is elsewhere. Btw. it might be worth considering whether pacemaker should restrict the environment variables for invocation of the resources, just as systemd does [1], so as to prevent accidental changes in their behavior like with OCF_TESTER_FAIL_HAVE_BINARY vs. IPaddr2. [1] http://www.freedesktop.org/software/systemd/man/systemd.exec.html#Environment%20variables%20in%20spawned%20processes -- Jan (Poki) pgpYOCG1swDoV.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] VIP monitoring failing with Timed Out error
On 29/10/15 15:27 +0530, Pritam Kharat wrote: > When I ran ocf-tester to test IPaddr2 agent > > ocf-tester -n sc_vip -o ip=192.168.20.188 -o cidr_netmask=24 -o nic=eth0 > /usr/lib/ocf/resource.d/heartbeat/IPaddr2 > > I got this error - ERROR: Setup problem: couldn't find command: ip > in test_command monitor. I verified ip command is there but still > this error. What might be the reason for this error ? Is this okay ? > > + ip_validate > + check_binary ip > + have_binary ip > + '[' 1 = 1 ']' > + false It may be the case that you have the environment tainted with a variable that should only be set in a special testing mode injecting an error of the particular helper binary missing. Can you please try "unset OCF_TESTER_FAIL_HAVE_BINARY" to sanitize your environment first? Indeed, if you don't have this variable set for sure in the context of IPAddr2 agent invocations, the problem is elsewhere. -- Jan (Poki) pgptrN37B8ovB.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] VIP monitoring failing with Timed Out error
Hi, On Thu, Oct 29, 2015 at 10:40:18AM +0530, Pritam Kharat wrote: > Thank you very much Ken for reply. I will try your suggested steps. If you cannot figure out from the logs why the stop operation times out, you can also try to trace the resource agent: # crm resource help trace # crm resource trace vip stop Then take a look at the trace or post it somewhere. Thanks, Dejan > > On Wed, Oct 28, 2015 at 11:23 PM, Ken Gaillotwrote: > > > On 10/28/2015 03:51 AM, Pritam Kharat wrote: > > > Hi All, > > > > > > I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE > > > node, it takes more time to stop and by this time VIP migration with > > other > > > resources migration fails to STANDBY node. (I have seen same issue in > > > ACTIVE node reboot case also) > > > > I assume STANDBY in this case is just a description of the node's > > purpose, and does not mean that you placed the node in pacemaker's > > standby mode. If the node really is in standby mode, it can't run any > > resources. > > > > > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1 > > > Stack: corosync > > > Current DC: node-1 (1) - partition with quorum > > > Version: 1.1.10-42f2063 > > > 2 Nodes configured > > > 2 Resources configured > > > > > > > > > Online: [ node-1 node-2 ] > > > > > > Full list of resources: > > > > > > resource (upstart:resource): Stopped > > > vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED > > > > > > Migration summary: > > > * Node node-1: > > > * Node node-2: > > > > > > Failed actions: > > > vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out, > > > last-rc-change=Wed Oct 28 03:05:24 2015 > > > , queued=0ms, exec=0ms > > > ): unknown error > > > > > > VIP monitor is failing over here with error Timed Out. What is the > > general > > > reason for TimeOut. ? I have kept default-action-timeout=180secs which > > > should be enough for monitoring > > > > 180s should be far more than enough, so something must be going wrong. > > Notice that it is the stop operation on the active node that is failing. > > Normally in such a case, pacemaker would fence that node to be sure that > > it is safe to bring it up elsewhere, but you have disabled stonith. > > > > Fencing is important in failure recovery such as this, so it would be a > > good idea to try to get it implemented. > > > > > I have added order property -> when vip is started then only start other > > > resources. > > > Any clue to solve this problem ? Most of the time this VIP monitoring is > > > failing with Timed Out error. > > > > The "stop" in "vip_stop_0" means that the stop operation is what failed. > > Have you seen timeouts on any other operations? > > > > Look through the logs around the time of the failure, and try to see if > > there are any indications as to why the stop failed. > > > > If you can set aside some time for testing or have a test cluster that > > exhibits the same issue, you can try unmanaging the resource in > > pacemaker, then: > > > > 1. Try adding/removing the IP via normal system commands, and make sure > > that works. > > > > 2. Try running the resource agent manually (with any verbose option) to > > start/stop/monitor the IP to see if you can reproduce the problem and > > get more messages. > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > -- > Thanks and Regards, > Pritam Kharat. > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] VIP monitoring failing with Timed Out error
On 10/28/2015 03:51 AM, Pritam Kharat wrote: > Hi All, > > I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE > node, it takes more time to stop and by this time VIP migration with other > resources migration fails to STANDBY node. (I have seen same issue in > ACTIVE node reboot case also) I assume STANDBY in this case is just a description of the node's purpose, and does not mean that you placed the node in pacemaker's standby mode. If the node really is in standby mode, it can't run any resources. > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1 > Stack: corosync > Current DC: node-1 (1) - partition with quorum > Version: 1.1.10-42f2063 > 2 Nodes configured > 2 Resources configured > > > Online: [ node-1 node-2 ] > > Full list of resources: > > resource (upstart:resource): Stopped > vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED > > Migration summary: > * Node node-1: > * Node node-2: > > Failed actions: > vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out, > last-rc-change=Wed Oct 28 03:05:24 2015 > , queued=0ms, exec=0ms > ): unknown error > > VIP monitor is failing over here with error Timed Out. What is the general > reason for TimeOut. ? I have kept default-action-timeout=180secs which > should be enough for monitoring 180s should be far more than enough, so something must be going wrong. Notice that it is the stop operation on the active node that is failing. Normally in such a case, pacemaker would fence that node to be sure that it is safe to bring it up elsewhere, but you have disabled stonith. Fencing is important in failure recovery such as this, so it would be a good idea to try to get it implemented. > I have added order property -> when vip is started then only start other > resources. > Any clue to solve this problem ? Most of the time this VIP monitoring is > failing with Timed Out error. The "stop" in "vip_stop_0" means that the stop operation is what failed. Have you seen timeouts on any other operations? Look through the logs around the time of the failure, and try to see if there are any indications as to why the stop failed. If you can set aside some time for testing or have a test cluster that exhibits the same issue, you can try unmanaging the resource in pacemaker, then: 1. Try adding/removing the IP via normal system commands, and make sure that works. 2. Try running the resource agent manually (with any verbose option) to start/stop/monitor the IP to see if you can reproduce the problem and get more messages. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] VIP monitoring failing with Timed Out error
Hi All, I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE node, it takes more time to stop and by this time VIP migration with other resources migration fails to STANDBY node. (I have seen same issue in ACTIVE node reboot case also) Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1 Stack: corosync Current DC: node-1 (1) - partition with quorum Version: 1.1.10-42f2063 2 Nodes configured 2 Resources configured Online: [ node-1 node-2 ] Full list of resources: resource (upstart:resource): Stopped vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED Migration summary: * Node node-1: * Node node-2: Failed actions: vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out, last-rc-change=Wed Oct 28 03:05:24 2015 , queued=0ms, exec=0ms ): unknown error VIP monitor is failing over here with error Timed Out. What is the general reason for TimeOut. ? I have kept default-action-timeout=180secs which should be enough for monitoring I have added order property -> when vip is started then only start other resources. Any clue to solve this problem ? Most of the time this VIP monitoring is failing with Timed Out error. -- Thanks and Regards, Pritam Kharat. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org