Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
On 10/05/2013, at 6:16 PM, pavan tc pavan...@gmail.com wrote: Well, and also Pacemaker's crmd process. My guess... the node is overloaded which is causing the cib queries to time out. Is there a cib query timeout value that I can set? No. You can set the batch-limit property though, this reduces the rate at which CIB operations are attempted Ok. But does it mean that it can still happen if the load was high at the time the CIB operation is in flight? Yes. But if the load is coming from the cluster running monitor on all the resources, then batch-limit will spread the work out over a longer period and avoid the machine being overloaded in the first place. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote: Hi Andrew, Thanks much for looking into this. I have some queries inline. Hi, I have a two-node cluster with STONITH disabled. Thats not a good idea. Ok. I'll try and configure stonith. I am still running with the pcmk plugin as opposed to the recommended CMAN plugin. On rhel6? Yes. With 1.1.8, I see some messages (appended to this mail) once in a while. I do not understand some keywords here - There is a Leave action. I am not sure what that is. It means the cluster is not going to change the state of the resource. Why did the cluster execute the Leave action at this point? Is there some other error that triggers this? Or is it a benign message? And, there is a CIB update failure that leads to a RECOVER action. There is a message that says the RECOVER action is not supported. Finally this leads to a stop and start of my resource. Well, and also Pacemaker's crmd process. My guess... the node is overloaded which is causing the cib queries to time out. Is there a cib query timeout value that I can set? I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. Thanks, Pavan I can copy the crm configure show output, but nothing special there. Thanks much. Pavan PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The underlying device that represents this resource has been removed. However, the resource is still part of the CIB. All errors related to that resource can be ignored. But can this cause a node to be stopped/fenced? Not if fencing is disabled. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
Is there a cib query timeout value that I can set? I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. I'll experiment with the cibadmin -t (--timeout) option to see if it helps. As I can see from the code, the default seems to be 30 ms. Is there a widely used default for systems with a high load or is it found out the hard way for each setup? Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
On 10/05/2013, at 1:44 PM, pavan tc pavan...@gmail.com wrote: On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote: Hi Andrew, Thanks much for looking into this. I have some queries inline. Hi, I have a two-node cluster with STONITH disabled. Thats not a good idea. Ok. I'll try and configure stonith. I am still running with the pcmk plugin as opposed to the recommended CMAN plugin. On rhel6? Yes. With 1.1.8, I see some messages (appended to this mail) once in a while. I do not understand some keywords here - There is a Leave action. I am not sure what that is. It means the cluster is not going to change the state of the resource. Why did the cluster execute the Leave action at this point? There is no Leave action being executed. We are simply logging that nothing is going to happen to that resource - it is in the state that we exepect/want. Is there some other error that triggers this? Or is it a benign message? And, there is a CIB update failure that leads to a RECOVER action. There is a message that says the RECOVER action is not supported. Finally this leads to a stop and start of my resource. Well, and also Pacemaker's crmd process. My guess... the node is overloaded which is causing the cib queries to time out. Is there a cib query timeout value that I can set? No. You can set the batch-limit property though, this reduces the rate at which CIB operations are attempted I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. Thanks, Pavan I can copy the crm configure show output, but nothing special there. Thanks much. Pavan PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The underlying device that represents this resource has been removed. However, the resource is still part of the CIB. All errors related to that resource can be ignored. But can this cause a node to be stopped/fenced? Not if fencing is disabled. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
I'll experiment with the cibadmin -t (--timeout) option to see if it helps. As I can see from the code, the default seems to be 30 ms. Is there a widely used default for systems with a high load or is it found out the hard way for each setup? Easier said than done. Can someone help with how to use the --timeout option in cibadmin? Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org