Re: [ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-30 Thread Jan Pokorný
On 29/10/15 14:10 +0100, Jan Pokorný wrote:
> On 29/10/15 15:27 +0530, Pritam Kharat wrote:
>> When I ran ocf-tester to test IPaddr2 agent
>> 
>> ocf-tester -n sc_vip -o ip=192.168.20.188 -o cidr_netmask=24 -o nic=eth0
>> /usr/lib/ocf/resource.d/heartbeat/IPaddr2
>> 
>> I got this error - ERROR: Setup problem: couldn't find command: ip
>> in test_command monitor.  I verified ip command is there but still
>> this error. What might be the reason for this error ? Is this okay ?
>> 
>> + ip_validate
>> + check_binary ip
>> + have_binary ip
>> + '[' 1 = 1 ']'
>> + false
> 
> It may be the case that you have the environment tainted with
> a variable that should only be set in a special testing mode
> injecting an error of the particular helper binary missing.
> 
> Can you please try "unset OCF_TESTER_FAIL_HAVE_BINARY" to sanitize
> your environment first?  Indeed, if you don't have this variable set
> for sure in the context of IPAddr2 agent invocations, the problem
> is elsewhere.

Btw. it might be worth considering whether pacemaker should restrict
the environment variables for invocation of the resources, just as
systemd does [1], so as to prevent accidental changes in their
behavior like with OCF_TESTER_FAIL_HAVE_BINARY vs. IPaddr2.

[1] 
http://www.freedesktop.org/software/systemd/man/systemd.exec.html#Environment%20variables%20in%20spawned%20processes

-- 
Jan (Poki)


pgpYOCG1swDoV.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-29 Thread Jan Pokorný
On 29/10/15 15:27 +0530, Pritam Kharat wrote:
> When I ran ocf-tester to test IPaddr2 agent
> 
> ocf-tester -n sc_vip -o ip=192.168.20.188 -o cidr_netmask=24 -o nic=eth0
> /usr/lib/ocf/resource.d/heartbeat/IPaddr2
> 
> I got this error - ERROR: Setup problem: couldn't find command: ip
> in test_command monitor.  I verified ip command is there but still
> this error. What might be the reason for this error ? Is this okay ?
> 
> + ip_validate
> + check_binary ip
> + have_binary ip
> + '[' 1 = 1 ']'
> + false

It may be the case that you have the environment tainted with
a variable that should only be set in a special testing mode
injecting an error of the particular helper binary missing.

Can you please try "unset OCF_TESTER_FAIL_HAVE_BINARY" to sanitize
your environment first?  Indeed, if you don't have this variable set
for sure in the context of IPAddr2 agent invocations, the problem
is elsewhere.

-- 
Jan (Poki)


pgptrN37B8ovB.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-29 Thread Dejan Muhamedagic
Hi,

On Thu, Oct 29, 2015 at 10:40:18AM +0530, Pritam Kharat wrote:
> Thank you very much Ken for reply. I will try your suggested steps.

If you cannot figure out from the logs why the stop operation
times out, you can also try to trace the resource agent:

# crm resource help trace
# crm resource trace vip stop

Then take a look at the trace or post it somewhere.

Thanks,

Dejan

> 
> On Wed, Oct 28, 2015 at 11:23 PM, Ken Gaillot  wrote:
> 
> > On 10/28/2015 03:51 AM, Pritam Kharat wrote:
> > > Hi All,
> > >
> > > I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE
> > > node, it takes more time to stop and by this time VIP migration with
> > other
> > > resources migration fails to STANDBY node. (I have seen same issue in
> > > ACTIVE node reboot case also)
> >
> > I assume STANDBY in this case is just a description of the node's
> > purpose, and does not mean that you placed the node in pacemaker's
> > standby mode. If the node really is in standby mode, it can't run any
> > resources.
> >
> > > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
> > > Stack: corosync
> > > Current DC: node-1 (1) - partition with quorum
> > > Version: 1.1.10-42f2063
> > > 2 Nodes configured
> > > 2 Resources configured
> > >
> > >
> > > Online: [ node-1 node-2 ]
> > >
> > > Full list of resources:
> > >
> > >  resource (upstart:resource): Stopped
> > >  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
> > >
> > > Migration summary:
> > > * Node node-1:
> > > * Node node-2:
> > >
> > > Failed actions:
> > > vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
> > > last-rc-change=Wed Oct 28 03:05:24 2015
> > > , queued=0ms, exec=0ms
> > > ): unknown error
> > >
> > > VIP monitor is failing over here with error Timed Out. What is the
> > general
> > > reason for TimeOut. ? I have kept default-action-timeout=180secs which
> > > should be enough for monitoring
> >
> > 180s should be far more than enough, so something must be going wrong.
> > Notice that it is the stop operation on the active node that is failing.
> > Normally in such a case, pacemaker would fence that node to be sure that
> > it is safe to bring it up elsewhere, but you have disabled stonith.
> >
> > Fencing is important in failure recovery such as this, so it would be a
> > good idea to try to get it implemented.
> >
> > > I have added order property -> when vip is started then only start other
> > > resources.
> > > Any clue to solve this problem ? Most of the time this VIP monitoring is
> > > failing with Timed Out error.
> >
> > The "stop" in "vip_stop_0" means that the stop operation is what failed.
> > Have you seen timeouts on any other operations?
> >
> > Look through the logs around the time of the failure, and try to see if
> > there are any indications as to why the stop failed.
> >
> > If you can set aside some time for testing or have a test cluster that
> > exhibits the same issue, you can try unmanaging the resource in
> > pacemaker, then:
> >
> > 1. Try adding/removing the IP via normal system commands, and make sure
> > that works.
> >
> > 2. Try running the resource agent manually (with any verbose option) to
> > start/stop/monitor the IP to see if you can reproduce the problem and
> > get more messages.
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> 
> 
> -- 
> Thanks and Regards,
> Pritam Kharat.

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-28 Thread Ken Gaillot
On 10/28/2015 03:51 AM, Pritam Kharat wrote:
> Hi All,
> 
> I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE
> node, it takes more time to stop and by this time VIP migration with other
> resources migration fails to STANDBY node. (I have seen same issue in
> ACTIVE node reboot case also)

I assume STANDBY in this case is just a description of the node's
purpose, and does not mean that you placed the node in pacemaker's
standby mode. If the node really is in standby mode, it can't run any
resources.

> Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
> Stack: corosync
> Current DC: node-1 (1) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node-1 node-2 ]
> 
> Full list of resources:
> 
>  resource (upstart:resource): Stopped
>  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
> 
> Migration summary:
> * Node node-1:
> * Node node-2:
> 
> Failed actions:
> vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
> last-rc-change=Wed Oct 28 03:05:24 2015
> , queued=0ms, exec=0ms
> ): unknown error
> 
> VIP monitor is failing over here with error Timed Out. What is the general
> reason for TimeOut. ? I have kept default-action-timeout=180secs which
> should be enough for monitoring

180s should be far more than enough, so something must be going wrong.
Notice that it is the stop operation on the active node that is failing.
Normally in such a case, pacemaker would fence that node to be sure that
it is safe to bring it up elsewhere, but you have disabled stonith.

Fencing is important in failure recovery such as this, so it would be a
good idea to try to get it implemented.

> I have added order property -> when vip is started then only start other
> resources.
> Any clue to solve this problem ? Most of the time this VIP monitoring is
> failing with Timed Out error.

The "stop" in "vip_stop_0" means that the stop operation is what failed.
Have you seen timeouts on any other operations?

Look through the logs around the time of the failure, and try to see if
there are any indications as to why the stop failed.

If you can set aside some time for testing or have a test cluster that
exhibits the same issue, you can try unmanaging the resource in
pacemaker, then:

1. Try adding/removing the IP via normal system commands, and make sure
that works.

2. Try running the resource agent manually (with any verbose option) to
start/stop/monitor the IP to see if you can reproduce the problem and
get more messages.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-28 Thread Pritam Kharat
Hi All,

I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE
node, it takes more time to stop and by this time VIP migration with other
resources migration fails to STANDBY node. (I have seen same issue in
ACTIVE node reboot case also)


Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
Stack: corosync
Current DC: node-1 (1) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
2 Resources configured


Online: [ node-1 node-2 ]

Full list of resources:

 resource (upstart:resource): Stopped
 vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED

Migration summary:
* Node node-1:
* Node node-2:

Failed actions:
vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
last-rc-change=Wed Oct 28 03:05:24 2015
, queued=0ms, exec=0ms
): unknown error

VIP monitor is failing over here with error Timed Out. What is the general
reason for TimeOut. ? I have kept default-action-timeout=180secs which
should be enough for monitoring
I have added order property -> when vip is started then only start other
resources.
Any clue to solve this problem ? Most of the time this VIP monitoring is
failing with Timed Out error.

-- 
Thanks and Regards,
Pritam Kharat.
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org