Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-12 Thread Andrew Beekhof

On 10/05/2013, at 6:16 PM, pavan tc pavan...@gmail.com wrote:

 
 
  Well, and also Pacemaker's crmd process.
  My guess... the node is overloaded which is causing the cib queries to time 
  out.
 
 
  Is there a cib query timeout value that I can set?
 
 No.  You can set the batch-limit property though, this reduces the rate at 
 which CIB operations are attempted
 
 
 Ok. But does it mean that it can still happen if the load was high at the 
 time the CIB operation is in flight?

Yes. But if the load is coming from the cluster running monitor on all the 
resources, then batch-limit will spread the work out over a longer period and 
avoid the machine being overloaded in the first place.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc
On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote:


 On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote:


Hi Andrew,

Thanks much for looking into this. I have some queries inline.


  Hi,
 
  I have a two-node cluster with STONITH disabled.

 Thats not a good idea.


Ok. I'll try and configure stonith.

 I am still running with the pcmk plugin as opposed to the recommended
 CMAN plugin.

 On rhel6?


Yes.



 
  With 1.1.8, I see some messages (appended to this mail) once in a while.
 I do not understand some keywords here - There is a Leave action. I am
 not sure what that is.

 It means the cluster is not going to change the state of the resource.


Why did the cluster execute the Leave action at this point? Is there some
other error that triggers this? Or is it a benign message?


  And, there is a CIB update failure that leads to a RECOVER action. There
 is a message that says the RECOVER action is not supported. Finally this
 leads to a stop and start of my resource.

 Well, and also Pacemaker's crmd process.
 My guess... the node is overloaded which is causing the cib queries to
 time out.


Is there a cib query timeout value that I can set? I was earlier getting
the TOTEM timeout.
So, I set the token to a larger value (5 seconds) in corosync.conf and
things were much better.
But now, I have started hitting this problem.

Thanks,
Pavan

 I can copy the crm configure show output, but nothing special there.
 
  Thanks much.
  Pavan
 
  PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The
 underlying device that represents this resource has been removed. However,
 the resource is still part of the CIB. All errors related to that resource
 can be ignored. But can this cause a node to be stopped/fenced?

 Not if fencing is disabled.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc

 Is there a cib query timeout value that I can set? I was earlier getting
 the TOTEM timeout.
 So, I set the token to a larger value (5 seconds) in corosync.conf and
 things were much better.
 But now, I have started hitting this problem.


I'll experiment with the cibadmin -t (--timeout) option to see if it helps.
As I can see from the code, the default seems to be 30 ms.
Is there a widely used default for systems with a high load or is it found
out the hard way for each setup?

Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 1:44 PM, pavan tc pavan...@gmail.com wrote:

 
 
 
 On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote:
 
 
 Hi Andrew,
 
 Thanks much for looking into this. I have some queries inline.
  
  Hi,
 
  I have a two-node cluster with STONITH disabled.
 
 Thats not a good idea.
 
 Ok. I'll try and configure stonith.
 
  I am still running with the pcmk plugin as opposed to the recommended CMAN 
  plugin.
 
 On rhel6?
 
 Yes.
  
 
 
  With 1.1.8, I see some messages (appended to this mail) once in a while. I 
  do not understand some keywords here - There is a Leave action. I am not 
  sure what that is.
 
 It means the cluster is not going to change the state of the resource.
 
 Why did the cluster execute the Leave action at this point?

There is no Leave action being executed.  We are simply logging that nothing 
is going to happen to that resource - it is in the state that we exepect/want.

 Is there some other error that triggers this? Or is it a benign message?
 
 
  And, there is a CIB update failure that leads to a RECOVER action. There is 
  a message that says the RECOVER action is not supported. Finally this leads 
  to a stop and start of my resource.
 
 Well, and also Pacemaker's crmd process.
 My guess... the node is overloaded which is causing the cib queries to time 
 out.
 
 
 Is there a cib query timeout value that I can set?

No.  You can set the batch-limit property though, this reduces the rate at 
which CIB operations are attempted

 I was earlier getting the TOTEM timeout.
 So, I set the token to a larger value (5 seconds) in corosync.conf and things 
 were much better.
 But now, I have started hitting this problem.
 
 Thanks,
 Pavan
 
  I can copy the crm configure show output, but nothing special there.
 
  Thanks much.
  Pavan
 
  PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The 
  underlying device that represents this resource has been removed. However, 
  the resource is still part of the CIB. All errors related to that resource 
  can be ignored. But can this cause a node to be stopped/fenced?
 
 Not if fencing is disabled.
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc


 I'll experiment with the cibadmin -t (--timeout) option to see if it helps.
 As I can see from the code, the default seems to be 30 ms.
 Is there a widely used default for systems with a high load or is it found
 out the hard way for each setup?


Easier said than done. Can someone help with how to use the --timeout
option in cibadmin?

Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org