Re: [Pacemaker] crm_resource -L not trustable right after restart
On 22 Jan 2014, at 10:54 am, Brian J. Murrell (brian) wrote: > On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote: >> >> What crm_mon are you looking at? >> I see stuff like: >> >> virt-fencing (stonith:fence_xvm):Started rhos4-node3 >> Resource Group: mysql-group >> mysql-vip(ocf::heartbeat:IPaddr2): Started rhos4-node3 >> mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 >> mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 > > Yes, you are right. I couldn't see the forest for the trees. > > I initially was optimistic about crm_mon being more truthful than > crm_resource but it turns out it is not. It can't be, they're both obtaining their data from the same place (the cib). > > Take for example these commands to set a constraint and start a resource > (which has already been defined at this point): > > [21/Jan/2014:13:46:40] cibadmin -o constraints -C -X ' id="res1-primary" node="node5" rsc="res1" score="20"/>' > [21/Jan/2014:13:46:41] cibadmin -o constraints -C -X ' id="res1-secondary" node="node6" rsc="res1" score="10"/>' > [21/Jan/2014:13:46:42] crm_resource -r 'res1' -p target-role -m -v 'Started' > > and then these repeated calls to crm_mon -1 on node5: > > [21/Jan/2014:13:46:42] crm_mon -1 > Last updated: Tue Jan 21 13:46:42 2014 > Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 > Stack: openais > Current DC: node5 - partition with quorum > Version: 1.1.10-14.el6_5.1-368c726 > 2 Nodes configured > 2 Resources configured > > > Online: [ node5 node6 ] > > st-fencing(stonith:fence_product):Started node5 > res1 (ocf::product:Target): Started node6 > > [21/Jan/2014:13:46:42] crm_mon -1 > Last updated: Tue Jan 21 13:46:42 2014 > Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 > Stack: openais > Current DC: node5 - partition with quorum > Version: 1.1.10-14.el6_5.1-368c726 > 2 Nodes configured > 2 Resources configured > > > Online: [ node5 node6 ] > > st-fencing(stonith:fence_product):Started node5 > res1 (ocf::product:Target): Started node6 > > [21/Jan/2014:13:46:49] crm_mon -1 -r > Last updated: Tue Jan 21 13:46:49 2014 > Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 > Stack: openais > Current DC: node5 - partition with quorum > Version: 1.1.10-14.el6_5.1-368c726 > 2 Nodes configured > 2 Resources configured > > > Online: [ node5 node6 ] > > Full list of resources: > > st-fencing(stonith:fence_product):Started node5 > res1 (ocf::product:Target): Started node5 > > The first two are not correct, showing the resource started on node6 > when it was actually started on node5. Was it running there to begin with? Answering my own question... yes. It was: > Jan 21 13:46:41 node5 crmd[8695]: warning: status_from_rc: Action 6 > (res1_monitor_0) on node6 failed (target: 7 vs. rc: 0): Error and then we try to stop it: > Jan 21 13:46:41 node5 crmd[8695]: notice: te_rsc_command: Initiating action > 7: stop res1_stop_0 on node6 So you are correct that something is wrong, but it isn't pacemaker. > Finally, 7 seconds later, it is > reporting correctly. The logs on node{5,6} bear this out. The resource > was actually only ever started on node5 and never on node6. Wrong. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote: > > What crm_mon are you looking at? > I see stuff like: > > virt-fencing (stonith:fence_xvm):Started rhos4-node3 > Resource Group: mysql-group > mysql-vip(ocf::heartbeat:IPaddr2): Started rhos4-node3 > mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 > mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 Yes, you are right. I couldn't see the forest for the trees. I initially was optimistic about crm_mon being more truthful than crm_resource but it turns out it is not. Take for example these commands to set a constraint and start a resource (which has already been defined at this point): [21/Jan/2014:13:46:40] cibadmin -o constraints -C -X '' [21/Jan/2014:13:46:41] cibadmin -o constraints -C -X '' [21/Jan/2014:13:46:42] crm_resource -r 'res1' -p target-role -m -v 'Started' and then these repeated calls to crm_mon -1 on node5: [21/Jan/2014:13:46:42] crm_mon -1 Last updated: Tue Jan 21 13:46:42 2014 Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 Stack: openais Current DC: node5 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 2 Nodes configured 2 Resources configured Online: [ node5 node6 ] st-fencing (stonith:fence_product):Started node5 res1 (ocf::product:Target): Started node6 [21/Jan/2014:13:46:42] crm_mon -1 Last updated: Tue Jan 21 13:46:42 2014 Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 Stack: openais Current DC: node5 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 2 Nodes configured 2 Resources configured Online: [ node5 node6 ] st-fencing (stonith:fence_product):Started node5 res1 (ocf::product:Target): Started node6 [21/Jan/2014:13:46:49] crm_mon -1 -r Last updated: Tue Jan 21 13:46:49 2014 Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5 Stack: openais Current DC: node5 - partition with quorum Version: 1.1.10-14.el6_5.1-368c726 2 Nodes configured 2 Resources configured Online: [ node5 node6 ] Full list of resources: st-fencing (stonith:fence_product):Started node5 res1 (ocf::product:Target): Started node5 The first two are not correct, showing the resource started on node6 when it was actually started on node5. Finally, 7 seconds later, it is reporting correctly. The logs on node{5,6} bear this out. The resource was actually only ever started on node5 and never on node6. Here's the log for node5: Jan 21 13:42:00 node5 pacemaker: Starting Pacemaker Cluster Manager Jan 21 13:42:00 node5 pacemakerd[8684]: notice: main: Starting Pacemaker 1.1.10-14.el6_5.1 (Build: 368c726): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios corosync-plugin cman Jan 21 13:42:00 node5 pacemakerd[8684]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 stonith-ng[8691]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Jan 21 13:42:00 node5 cib[8690]: notice: main: Using new config location: /var/lib/pacemaker/cib Jan 21 13:42:00 node5 cib[8690]: warning: retrieveCib: Cluster configuration not found: /var/lib/pacemaker/cib/cib.xml Jan 21 13:42:00 node5 cib[8690]: warning: readCibXmlFile: Primary configuration corrupt or unusable, trying backups Jan 21 13:42:00 node5 cib[8690]: warning: readCibXmlFile: Continuing with an empty configuration. Jan 21 13:42:00 node5 attrd[8693]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Jan 21 13:42:00 node5 crmd[8695]: notice: main: CRM Git Version: 368c726 Jan 21 13:42:00 node5 attrd[8693]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 corosync[8646]: [pcmk ] info: pcmk_ipc: Recorded connection 0x1cbc3c0 for attrd/0 Jan 21 13:42:00 node5 stonith-ng[8691]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 corosync[8646]: [pcmk ] info: pcmk_ipc: Recorded connection 0x1cb8040 for stonith-ng/0 Jan 21 13:42:00 node5 attrd[8693]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 stonith-ng[8691]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 attrd[8693]: notice: main: Starting mainloop... Jan 21 13:42:00 node5 cib[8690]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Jan 21 13:42:00 node5 cib[8690]: notice: get_node_name: Defaulting to uname -n for the local classic openais (with plugin) node name Jan 21 13:42:00 node5 corosync[8646]: [pcmk ] info: pcmk_ipc: Recorded connection 0x1cc0740 for cib/0 Jan 21 13:42:00 node5
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 16 Jan 2014, at 1:13 pm, Brian J. Murrell (brian) wrote: > On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote: >> >> I know, I was giving you another example of when the cib is not completely >> up-to-date with reality. > > Yeah, I understood that. I was just countering with why that example is > actually more acceptable. > >> It may very well be partially started. > > Sure. > >> Its almost certainly not stopped which is what is being reported. > > Right. But until it is completely started (and ready to do whatever > it's supposed to do), it might as well be considered stopped. If you > have to make a binary state out of stopped, starting, started, I think > most people will agree that the states are stopped and starting and > stopped is anything < starting since most things are not useful until > they are fully started. > >> You're not using the output to decide whether to perform some logic? > > Nope. Just reporting the state. But that's difficult when you have two > participants making positive assertions about state when one is not > really in a position to do so. > >> Because crm_mon is the more usual command to run right after startup > > The problem with crm_mon is that it doesn't tell you where a resource is > running. What crm_mon are you looking at? I see stuff like: virt-fencing (stonith:fence_xvm):Started rhos4-node3 Resource Group: mysql-group mysql-vip (ocf::heartbeat:IPaddr2): Started rhos4-node3 mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 > >> (which would give you enough context to know things are still syncing). > > That's interesting. Would polling crm_mon be more efficient than > polling the remote CIB with cibadmin -Q? crm_mon in interactive mode subscribes to updates from the cib. which would be more efficient than repeatedly calling cibadmin or crm_mon > >> DC election happens at the crmd. > > So would it be fair to say then that I should not trust the local CIB > until DC election has finished or could there be latency between that > completing and the CIB being refreshed? After the join completes (which happens after the election or when a new node is found), then it is safe. You can tell this by running crmadmin -S -H `uname -n` and looking for S_IDLE, S_POLICY_ENGINE or S_TRANSITION_ENGINE iirc > > If DC election completion is accurate, what's the best way to determine > that has completed? Ideally it doesn't happen when a node joins an existing cluster. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote: > > I know, I was giving you another example of when the cib is not completely > up-to-date with reality. Yeah, I understood that. I was just countering with why that example is actually more acceptable. > It may very well be partially started. Sure. > Its almost certainly not stopped which is what is being reported. Right. But until it is completely started (and ready to do whatever it's supposed to do), it might as well be considered stopped. If you have to make a binary state out of stopped, starting, started, I think most people will agree that the states are stopped and starting and stopped is anything < starting since most things are not useful until they are fully started. > You're not using the output to decide whether to perform some logic? Nope. Just reporting the state. But that's difficult when you have two participants making positive assertions about state when one is not really in a position to do so. > Because crm_mon is the more usual command to run right after startup The problem with crm_mon is that it doesn't tell you where a resource is running. > (which would give you enough context to know things are still syncing). That's interesting. Would polling crm_mon be more efficient than polling the remote CIB with cibadmin -Q? > DC election happens at the crmd. So would it be fair to say then that I should not trust the local CIB until DC election has finished or could there be latency between that completing and the CIB being refreshed? If DC election completion is accurate, what's the best way to determine that has completed? b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 16 Jan 2014, at 6:53 am, Brian J. Murrell (brian) wrote: > On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote: >> >> Consider any long running action, such as starting a database. >> We do not update the CIB until after actions have completed, so there can >> and will be times when the status section is out of date to one degree or >> another. > > But that is the opposite of what I am reporting I know, I was giving you another example of when the cib is not completely up-to-date with reality. > and is acceptable. It's > acceptable for a resource that is in the process of starting being > reported as stopped, because it's not yet started. It may very well be partially started. Its almost certainly not stopped which is what is being reported. > > What I am seeing is resources being reported as stopped when they are in > fact started/running and have been for a long time. > >> At node startup is another point at which the status could potentially be >> behind. > > Right. Which is the case I am talking about. > >> It sounds to me like you're trying to second guess the cluster, which is a >> dangerous path. > > No, not trying to second guess at all. You're not using the output to decide whether to perform some logic? Because crm_mon is the more usual command to run right after startup (which would give you enough context to know things are still syncing). > I'm just trying to ask the > cluster what the state is and not getting the truth. I am willing to > believe whatever state the cluster says it's in as long as what I am > getting is the truth. > >> What if its the first node to start up? > > I'd think a timeout comes in to play here. > >> There'd be no fresh copy to arrive in that case. > > I can't say that I know how the CIB works internally/entirely, but I'd > imagine that when a cluster node starts up it tries to see if there is a > more fresh CIB out there in the cluster. Nope. > Maybe this is part of the > process of choosing/discovering a DC. DC election happens at the crmd. The cib is a dumb repository of name/value pairs. It doesn't even understand new vs. old - only different. > But ultimately if the node is the > first one up, it will eventually figure that out so that it can nominate > itself as the DC. Or it finds out that there is a DC already (and gets > a fresh CIB from it?). It's during that window that I propose that > crm_resource should not be asserting anything and should just admit that > it does not (yet) know. > >> If it had enough information to know it was out of date, it wouldn't be out >> of date. > > But surely it understands if it is in the process of joining a cluster > or not, and therefore does know enough to know that it doesn't know if > it's out of date or not. And if it has a newer config compared to the existing nodes? > But that it could be. > >> As above, there are situations when you'd never get an answer. > > I should have added to my proposal "or has determined that there is > nothing to refresh it's CIB from" and that it's local copy is > authoritative for the whole cluster. > > b. > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote: > > Consider any long running action, such as starting a database. > We do not update the CIB until after actions have completed, so there can and > will be times when the status section is out of date to one degree or another. But that is the opposite of what I am reporting and is acceptable. It's acceptable for a resource that is in the process of starting being reported as stopped, because it's not yet started. What I am seeing is resources being reported as stopped when they are in fact started/running and have been for a long time. > At node startup is another point at which the status could potentially be > behind. Right. Which is the case I am talking about. > It sounds to me like you're trying to second guess the cluster, which is a > dangerous path. No, not trying to second guess at all. I'm just trying to ask the cluster what the state is and not getting the truth. I am willing to believe whatever state the cluster says it's in as long as what I am getting is the truth. > What if its the first node to start up? I'd think a timeout comes in to play here. > There'd be no fresh copy to arrive in that case. I can't say that I know how the CIB works internally/entirely, but I'd imagine that when a cluster node starts up it tries to see if there is a more fresh CIB out there in the cluster. Maybe this is part of the process of choosing/discovering a DC. But ultimately if the node is the first one up, it will eventually figure that out so that it can nominate itself as the DC. Or it finds out that there is a DC already (and gets a fresh CIB from it?). It's during that window that I propose that crm_resource should not be asserting anything and should just admit that it does not (yet) know. > If it had enough information to know it was out of date, it wouldn't be out > of date. But surely it understands if it is in the process of joining a cluster or not, and therefore does know enough to know that it doesn't know if it's out of date or not. But that it could be. > As above, there are situations when you'd never get an answer. I should have added to my proposal "or has determined that there is nothing to refresh it's CIB from" and that it's local copy is authoritative for the whole cluster. b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 11:50 pm, Brian J. Murrell (brian) wrote: > On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote: >> >>> On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: The local cib hasn't caught up yet by the looks of it. > > I should have asked in my previous message: is this entirely an artifact > of having just restarted or are there any other times where the local > CIB can in fact be out of date (and thus crm_resource is inaccurate), if > even for a brief period of time? I just want to completely understand > the nature of this situation. Consider any long running action, such as starting a database. We do not update the CIB until after actions have completed, so there can and will be times when the status section is out of date to one degree or another. At node startup is another point at which the status could potentially be behind. It sounds to me like you're trying to second guess the cluster, which is a dangerous path. > >> It doesn't know that it doesn't know. > > But it (pacemaker at least) does know that it's just started up, and > should also know whether it's gotten a fresh copy of the CIB since > starting up, right? What if its the first node to start up? There'd be no fresh copy to arrive in that case. Many things are obvious to external observers that are not at all obvious to the cluster. If it had enough information to know it was out of date, it wouldn't be out of date. > I think I'd consider it required behaviour that > pacemaker not consider itself authoritative enough to provide answers > like "location" until it has gotten a fresh copy of the CIB. > >> Does it show anything as running? Any nodes as online? > > >> I'd not expect that it stays in that situation for more than a second or >> two... > > You are probably right about that. But unfortunately that second or two > provides a large enough window to provide mis-information. > >> We could add an option to force crm_resource to use the master instance >> instead of the local one I guess. > > Or, depending on the answers to above (like can this local-is-not-true > situation every manifest itself at times other than "just started") > perhaps just don't allow crm_resource (or any other tool) to provide > information from the local CIB until it's been refreshed at least once > since a startup. As above, there are situations when you'd never get an answer. > > I would much rather crm_resource experience some latency in being able > to provide answers than provide wrong ones. Perhaps there needs to be a > switch to indicate if it should block waiting for the local CIB to be > up-to-date or should return immediately with an "unknown" type response > if the local CIB has not yet been updated since a start. > > Cheers, > b. > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote: > > > On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: > >> > >> The local cib hasn't caught up yet by the looks of it. I should have asked in my previous message: is this entirely an artifact of having just restarted or are there any other times where the local CIB can in fact be out of date (and thus crm_resource is inaccurate), if even for a brief period of time? I just want to completely understand the nature of this situation. > It doesn't know that it doesn't know. But it (pacemaker at least) does know that it's just started up, and should also know whether it's gotten a fresh copy of the CIB since starting up, right? I think I'd consider it required behaviour that pacemaker not consider itself authoritative enough to provide answers like "location" until it has gotten a fresh copy of the CIB. > Does it show anything as running? Any nodes as online? > I'd not expect that it stays in that situation for more than a second or > two... You are probably right about that. But unfortunately that second or two provides a large enough window to provide mis-information. > We could add an option to force crm_resource to use the master instance > instead of the local one I guess. Or, depending on the answers to above (like can this local-is-not-true situation every manifest itself at times other than "just started") perhaps just don't allow crm_resource (or any other tool) to provide information from the local CIB until it's been refreshed at least once since a startup. I would much rather crm_resource experience some latency in being able to provide answers than provide wrong ones. Perhaps there needs to be a switch to indicate if it should block waiting for the local CIB to be up-to-date or should return immediately with an "unknown" type response if the local CIB has not yet been updated since a start. Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 3:41 pm, Brian J. Murrell (brian) wrote: > On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: >> >> The local cib hasn't caught up yet by the looks of it. > > Should crm_resource actually be [mis-]reporting as if it were > knowledgeable when it's not though? IOW is this expected behaviour or > should it be considered a bug? Should I open a ticket? It doesn't know that it doesn't know. Does it show anything as running? Any nodes as online? I'd not expect that it stays in that situation for more than a second or two... > >> You could compare 'cibadmin -Ql' with 'cibadmin -Q' > > Is there no other way to force crm_resource to be truthful/accurate or > silent if it cannot be truthful/accurate? Having to run this kind of > pre-check before every crm_resource --locate seems like it's going to > drive overhead up quite a bit. True. > > Maybe I am using the wrong tool for the job. Is there a better tool > than crm_resource to ascertain, with full truthfullness (or silence if > truthfullness is not possible), where resources are running? We could add an option to force crm_resource to use the master instance instead of the local one I guess. signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: > > The local cib hasn't caught up yet by the looks of it. Should crm_resource actually be [mis-]reporting as if it were knowledgeable when it's not though? IOW is this expected behaviour or should it be considered a bug? Should I open a ticket? > You could compare 'cibadmin -Ql' with 'cibadmin -Q' Is there no other way to force crm_resource to be truthful/accurate or silent if it cannot be truthful/accurate? Having to run this kind of pre-check before every crm_resource --locate seems like it's going to drive overhead up quite a bit. Maybe I am using the wrong tool for the job. Is there a better tool than crm_resource to ascertain, with full truthfullness (or silence if truthfullness is not possible), where resources are running? Cheers, b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_resource -L not trustable right after restart
On 14 Jan 2014, at 5:13 am, Brian J. Murrell (brian) wrote: > Hi, > > I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output > of "crm_resource -L" is not trust-able, shortly after a node is booted. > > Here is the output from crm_resource -L on one of the nodes in a two > node cluster (the one that was not rebooted): > > st-fencing(stonith:fence_foo):Started > res1 (ocf::foo:Target): Started > res2 (ocf::foo:Target): Started > > Here is the output from the same command on the other node in the two > node cluster right after it was rebooted: > > st-fencing(stonith:fence_foo):Stopped > res1 (ocf::foo:Target): Stopped > res2 (ocf::foo:Target): Stopped > > These were collected at the same time (within the same second) on the > two nodes. > > Clearly the rebooted node is not telling the truth. Perhaps the truth > for it is "I don't know", which would be fair enough but that's not what > pacemaker is asserting there. > > So, how do I know (i.e. programmatically -- what command can I issue to > know) if and when crm_resource can be trusted to be truthful? The local cib hasn't caught up yet by the looks of it. You could compare 'cibadmin -Ql' with 'cibadmin -Q' > > b. > > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] crm_resource -L not trustable right after restart
Hi, I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output of "crm_resource -L" is not trust-able, shortly after a node is booted. Here is the output from crm_resource -L on one of the nodes in a two node cluster (the one that was not rebooted): st-fencing (stonith:fence_foo):Started res1 (ocf::foo:Target): Started res2 (ocf::foo:Target): Started Here is the output from the same command on the other node in the two node cluster right after it was rebooted: st-fencing (stonith:fence_foo):Stopped res1 (ocf::foo:Target): Stopped res2 (ocf::foo:Target): Stopped These were collected at the same time (within the same second) on the two nodes. Clearly the rebooted node is not telling the truth. Perhaps the truth for it is "I don't know", which would be fair enough but that's not what pacemaker is asserting there. So, how do I know (i.e. programmatically -- what command can I issue to know) if and when crm_resource can be trusted to be truthful? b. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org