Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-02-17 Thread Andrew Beekhof

On 22 Jan 2014, at 10:54 am, Brian J. Murrell (brian)  
wrote:

> On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote:
>> 
>> What crm_mon are you looking at?
>> I see stuff like:
>> 
>> virt-fencing (stonith:fence_xvm):Started rhos4-node3 
>> Resource Group: mysql-group
>> mysql-vip(ocf::heartbeat:IPaddr2):   Started rhos4-node3 
>> mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 
>> mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 
> 
> Yes, you are right.  I couldn't see the forest for the trees.
> 
> I initially was optimistic about crm_mon being more truthful than
> crm_resource but it turns out it is not.

It can't be, they're both obtaining their data from the same place (the cib).

> 
> Take for example these commands to set a constraint and start a resource
> (which has already been defined at this point):
> 
> [21/Jan/2014:13:46:40] cibadmin -o constraints -C -X ' id="res1-primary" node="node5" rsc="res1" score="20"/>'
> [21/Jan/2014:13:46:41] cibadmin -o constraints -C -X ' id="res1-secondary" node="node6" rsc="res1" score="10"/>'
> [21/Jan/2014:13:46:42] crm_resource -r 'res1' -p target-role -m -v 'Started'
> 
> and then these repeated calls to crm_mon -1 on node5:
> 
> [21/Jan/2014:13:46:42] crm_mon -1
> Last updated: Tue Jan 21 13:46:42 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> st-fencing(stonith:fence_product):Started node5 
> res1  (ocf::product:Target):  Started node6 
> 
> [21/Jan/2014:13:46:42] crm_mon -1
> Last updated: Tue Jan 21 13:46:42 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> st-fencing(stonith:fence_product):Started node5 
> res1  (ocf::product:Target):  Started node6 
> 
> [21/Jan/2014:13:46:49] crm_mon -1 -r
> Last updated: Tue Jan 21 13:46:49 2014
> Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
> Stack: openais
> Current DC: node5 - partition with quorum
> Version: 1.1.10-14.el6_5.1-368c726
> 2 Nodes configured
> 2 Resources configured
> 
> 
> Online: [ node5 node6 ]
> 
> Full list of resources:
> 
> st-fencing(stonith:fence_product):Started node5 
> res1  (ocf::product:Target):  Started node5 
> 
> The first two are not correct, showing the resource started on node6
> when it was actually started on node5.

Was it running there to begin with?
Answering my own question... yes. It was:

> Jan 21 13:46:41 node5 crmd[8695]:  warning: status_from_rc: Action 6 
> (res1_monitor_0) on node6 failed (target: 7 vs. rc: 0): Error

and then we try to stop it:

> Jan 21 13:46:41 node5 crmd[8695]:   notice: te_rsc_command: Initiating action 
> 7: stop res1_stop_0 on node6


So you are correct that something is wrong, but it isn't pacemaker.


>  Finally, 7 seconds later, it is
> reporting correctly.  The logs on node{5,6} bear this out.  The resource
> was actually only ever started on node5 and never on node6.

Wrong.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-21 Thread Brian J. Murrell (brian)
On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote:
> 
> What crm_mon are you looking at?
> I see stuff like:
> 
>  virt-fencing (stonith:fence_xvm):Started rhos4-node3 
>  Resource Group: mysql-group
>  mysql-vip(ocf::heartbeat:IPaddr2):   Started rhos4-node3 
>  mysql-fs (ocf::heartbeat:Filesystem):Started rhos4-node3 
>  mysql-db (ocf::heartbeat:mysql): Started rhos4-node3 

Yes, you are right.  I couldn't see the forest for the trees.

I initially was optimistic about crm_mon being more truthful than
crm_resource but it turns out it is not.

Take for example these commands to set a constraint and start a resource
(which has already been defined at this point):

[21/Jan/2014:13:46:40] cibadmin -o constraints -C -X ''
[21/Jan/2014:13:46:41] cibadmin -o constraints -C -X ''
[21/Jan/2014:13:46:42] crm_resource -r 'res1' -p target-role -m -v 'Started'

and then these repeated calls to crm_mon -1 on node5:

[21/Jan/2014:13:46:42] crm_mon -1
Last updated: Tue Jan 21 13:46:42 2014
Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
Stack: openais
Current DC: node5 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured
2 Resources configured


Online: [ node5 node6 ]

 st-fencing (stonith:fence_product):Started node5 
 res1   (ocf::product:Target):  Started node6 

[21/Jan/2014:13:46:42] crm_mon -1
Last updated: Tue Jan 21 13:46:42 2014
Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
Stack: openais
Current DC: node5 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured
2 Resources configured


Online: [ node5 node6 ]

 st-fencing (stonith:fence_product):Started node5 
 res1   (ocf::product:Target):  Started node6 

[21/Jan/2014:13:46:49] crm_mon -1 -r
Last updated: Tue Jan 21 13:46:49 2014
Last change: Tue Jan 21 13:46:42 2014 via crm_resource on node5
Stack: openais
Current DC: node5 - partition with quorum
Version: 1.1.10-14.el6_5.1-368c726
2 Nodes configured
2 Resources configured


Online: [ node5 node6 ]

Full list of resources:

 st-fencing (stonith:fence_product):Started node5 
 res1   (ocf::product:Target):  Started node5 

The first two are not correct, showing the resource started on node6
when it was actually started on node5.  Finally, 7 seconds later, it is
reporting correctly.  The logs on node{5,6} bear this out.  The resource
was actually only ever started on node5 and never on node6.

Here's the log for node5:

Jan 21 13:42:00 node5 pacemaker: Starting Pacemaker Cluster Manager
Jan 21 13:42:00 node5 pacemakerd[8684]:   notice: main: Starting Pacemaker 
1.1.10-14.el6_5.1 (Build: 368c726):  generated-manpages agent-manpages 
ascii-docs publican-docs ncurses libqb-logging libqb-ipc nagios  
corosync-plugin cman
Jan 21 13:42:00 node5 pacemakerd[8684]:   notice: get_node_name: Defaulting to 
uname -n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 stonith-ng[8691]:   notice: crm_cluster_connect: 
Connecting to cluster infrastructure: classic openais (with plugin)
Jan 21 13:42:00 node5 cib[8690]:   notice: main: Using new config location: 
/var/lib/pacemaker/cib
Jan 21 13:42:00 node5 cib[8690]:  warning: retrieveCib: Cluster configuration 
not found: /var/lib/pacemaker/cib/cib.xml
Jan 21 13:42:00 node5 cib[8690]:  warning: readCibXmlFile: Primary 
configuration corrupt or unusable, trying backups
Jan 21 13:42:00 node5 cib[8690]:  warning: readCibXmlFile: Continuing with an 
empty configuration.
Jan 21 13:42:00 node5 attrd[8693]:   notice: crm_cluster_connect: Connecting to 
cluster infrastructure: classic openais (with plugin)
Jan 21 13:42:00 node5 crmd[8695]:   notice: main: CRM Git Version: 368c726
Jan 21 13:42:00 node5 attrd[8693]:   notice: get_node_name: Defaulting to uname 
-n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 corosync[8646]:   [pcmk  ] info: pcmk_ipc: Recorded 
connection 0x1cbc3c0 for attrd/0
Jan 21 13:42:00 node5 stonith-ng[8691]:   notice: get_node_name: Defaulting to 
uname -n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 corosync[8646]:   [pcmk  ] info: pcmk_ipc: Recorded 
connection 0x1cb8040 for stonith-ng/0
Jan 21 13:42:00 node5 attrd[8693]:   notice: get_node_name: Defaulting to uname 
-n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 stonith-ng[8691]:   notice: get_node_name: Defaulting to 
uname -n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 attrd[8693]:   notice: main: Starting mainloop...
Jan 21 13:42:00 node5 cib[8690]:   notice: crm_cluster_connect: Connecting to 
cluster infrastructure: classic openais (with plugin)
Jan 21 13:42:00 node5 cib[8690]:   notice: get_node_name: Defaulting to uname 
-n for the local classic openais (with plugin) node name
Jan 21 13:42:00 node5 corosync[8646]:   [pcmk  ] info: pcmk_ipc: Recorded 
connection 0x1cc0740 for cib/0
Jan 21 13:42:00 node5

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 1:13 pm, Brian J. Murrell (brian)  
wrote:

> On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote:
>> 
>> I know, I was giving you another example of when the cib is not completely 
>> up-to-date with reality.
> 
> Yeah, I understood that.  I was just countering with why that example is
> actually more acceptable.
> 
>> It may very well be partially started.
> 
> Sure.
> 
>> Its almost certainly not stopped which is what is being reported.
> 
> Right.  But until it is completely started (and ready to do whatever
> it's supposed to do), it might as well be considered stopped.  If you
> have to make a binary state out of stopped, starting, started, I think
> most people will agree that the states are stopped and starting and
> stopped is anything < starting since most things are not useful until
> they are fully started.
> 
>> You're not using the output to decide whether to perform some logic?
> 
> Nope.  Just reporting the state.  But that's difficult when you have two
> participants making positive assertions about state when one is not
> really in a position to do so.
> 
>> Because crm_mon is the more usual command to run right after startup
> 
> The problem with crm_mon is that it doesn't tell you where a resource is
> running.

What crm_mon are you looking at?
I see stuff like:

 virt-fencing   (stonith:fence_xvm):Started rhos4-node3 
 Resource Group: mysql-group
 mysql-vip  (ocf::heartbeat:IPaddr2):   Started rhos4-node3 
 mysql-fs   (ocf::heartbeat:Filesystem):Started rhos4-node3 
 mysql-db   (ocf::heartbeat:mysql): Started rhos4-node3 


> 
>> (which would give you enough context to know things are still syncing).
> 
> That's interesting.  Would polling crm_mon be more efficient than
> polling the remote CIB with cibadmin -Q?

crm_mon in interactive mode subscribes to updates from the cib.
which would be more efficient than repeatedly calling cibadmin or crm_mon 

> 
>> DC election happens at the crmd.
> 
> So would it be fair to say then that I should not trust the local CIB
> until DC election has finished or could there be latency between that
> completing and the CIB being refreshed?

After the join completes (which happens after the election or when a new node 
is found), then it is safe.
You can tell this by running crmadmin -S -H `uname -n` and looking for S_IDLE, 
S_POLICY_ENGINE or S_TRANSITION_ENGINE iirc

> 
> If DC election completion is accurate, what's the best way to determine
> that has completed?

Ideally it doesn't happen when a node joins an existing cluster.



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Brian J. Murrell (brian)
On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote:
> 
> I know, I was giving you another example of when the cib is not completely 
> up-to-date with reality.

Yeah, I understood that.  I was just countering with why that example is
actually more acceptable.

> It may very well be partially started.

Sure.

> Its almost certainly not stopped which is what is being reported.

Right.  But until it is completely started (and ready to do whatever
it's supposed to do), it might as well be considered stopped.  If you
have to make a binary state out of stopped, starting, started, I think
most people will agree that the states are stopped and starting and
stopped is anything < starting since most things are not useful until
they are fully started.

> You're not using the output to decide whether to perform some logic?

Nope.  Just reporting the state.  But that's difficult when you have two
participants making positive assertions about state when one is not
really in a position to do so.

> Because crm_mon is the more usual command to run right after startup

The problem with crm_mon is that it doesn't tell you where a resource is
running.

>  (which would give you enough context to know things are still syncing).

That's interesting.  Would polling crm_mon be more efficient than
polling the remote CIB with cibadmin -Q?

> DC election happens at the crmd.

So would it be fair to say then that I should not trust the local CIB
until DC election has finished or could there be latency between that
completing and the CIB being refreshed?

If DC election completion is accurate, what's the best way to determine
that has completed?

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Andrew Beekhof

On 16 Jan 2014, at 6:53 am, Brian J. Murrell (brian)  
wrote:

> On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
>> 
>> Consider any long running action, such as starting a database.
>> We do not update the CIB until after actions have completed, so there can 
>> and will be times when the status section is out of date to one degree or 
>> another.
> 
> But that is the opposite of what I am reporting

I know, I was giving you another example of when the cib is not completely 
up-to-date with reality.

> and is acceptable.  It's
> acceptable for a resource that is in the process of starting being
> reported as stopped, because it's not yet started.

It may very well be partially started.  Its almost certainly not stopped which 
is what is being reported.

> 
> What I am seeing is resources being reported as stopped when they are in
> fact started/running and have been for a long time.
> 
>> At node startup is another point at which the status could potentially be 
>> behind.
> 
> Right.  Which is the case I am talking about.
> 
>> It sounds to me like you're trying to second guess the cluster, which is a 
>> dangerous path.
> 
> No, not trying to second guess at all.

You're not using the output to decide whether to perform some logic?
Because crm_mon is the more usual command to run right after startup (which 
would give you enough context to know things are still syncing).

>  I'm just trying to ask the
> cluster what the state is and not getting the truth.  I am willing to
> believe whatever state the cluster says it's in as long as what I am
> getting is the truth.
> 
>> What if its the first node to start up?
> 
> I'd think a timeout comes in to play here.
> 
>> There'd be no fresh copy to arrive in that case.
> 
> I can't say that I know how the CIB works internally/entirely, but I'd
> imagine that when a cluster node starts up it tries to see if there is a
> more fresh CIB out there in the cluster.

Nope.

>  Maybe this is part of the
> process of choosing/discovering a DC.

DC election happens at the crmd.  The cib is a dumb repository of name/value 
pairs.
It doesn't even understand new vs. old - only different. 

>  But ultimately if the node is the
> first one up, it will eventually figure that out so that it can nominate
> itself as the DC.  Or it finds out that there is a DC already (and gets
> a fresh CIB from it?).  It's during that window that I propose that
> crm_resource should not be asserting anything and should just admit that
> it does not (yet) know.
> 
>> If it had enough information to know it was out of date, it wouldn't be out 
>> of date.
> 
> But surely it understands if it is in the process of joining a cluster
> or not, and therefore does know enough to know that it doesn't know if
> it's out of date or not.

And if it has a newer config compared to the existing nodes?

>  But that it could be.
> 
>> As above, there are situations when you'd never get an answer.
> 
> I should have added to my proposal "or has determined that there is
> nothing to refresh it's CIB from" and that it's local copy is
> authoritative for the whole cluster.
> 
> b.
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Brian J. Murrell (brian)
On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
> 
> Consider any long running action, such as starting a database.
> We do not update the CIB until after actions have completed, so there can and 
> will be times when the status section is out of date to one degree or another.

But that is the opposite of what I am reporting and is acceptable.  It's
acceptable for a resource that is in the process of starting being
reported as stopped, because it's not yet started.

What I am seeing is resources being reported as stopped when they are in
fact started/running and have been for a long time.

> At node startup is another point at which the status could potentially be 
> behind.

Right.  Which is the case I am talking about.

> It sounds to me like you're trying to second guess the cluster, which is a 
> dangerous path.

No, not trying to second guess at all.  I'm just trying to ask the
cluster what the state is and not getting the truth.  I am willing to
believe whatever state the cluster says it's in as long as what I am
getting is the truth.

> What if its the first node to start up?

I'd think a timeout comes in to play here.

> There'd be no fresh copy to arrive in that case.

I can't say that I know how the CIB works internally/entirely, but I'd
imagine that when a cluster node starts up it tries to see if there is a
more fresh CIB out there in the cluster.  Maybe this is part of the
process of choosing/discovering a DC.  But ultimately if the node is the
first one up, it will eventually figure that out so that it can nominate
itself as the DC.  Or it finds out that there is a DC already (and gets
a fresh CIB from it?).  It's during that window that I propose that
crm_resource should not be asserting anything and should just admit that
it does not (yet) know.

> If it had enough information to know it was out of date, it wouldn't be out 
> of date.

But surely it understands if it is in the process of joining a cluster
or not, and therefore does know enough to know that it doesn't know if
it's out of date or not.  But that it could be.

> As above, there are situations when you'd never get an answer.

I should have added to my proposal "or has determined that there is
nothing to refresh it's CIB from" and that it's local copy is
authoritative for the whole cluster.

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-14 Thread Andrew Beekhof

On 14 Jan 2014, at 11:50 pm, Brian J. Murrell (brian)  
wrote:

> On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote:
>> 
>>> On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
 
 The local cib hasn't caught up yet by the looks of it.
> 
> I should have asked in my previous message: is this entirely an artifact
> of having just restarted or are there any other times where the local
> CIB can in fact be out of date (and thus crm_resource is inaccurate), if
> even for a brief period of time?  I just want to completely understand
> the nature of this situation.

Consider any long running action, such as starting a database.
We do not update the CIB until after actions have completed, so there can and 
will be times when the status section is out of date to one degree or another.
At node startup is another point at which the status could potentially be 
behind.

It sounds to me like you're trying to second guess the cluster, which is a 
dangerous path.

> 
>> It doesn't know that it doesn't know.
> 
> But it (pacemaker at least) does know that it's just started up, and
> should also know whether it's gotten a fresh copy of the CIB since
> starting up, right?  

What if its the first node to start up?  There'd be no fresh copy to arrive in 
that case.
Many things are obvious to external observers that are not at all obvious to 
the cluster.

If it had enough information to know it was out of date, it wouldn't be out of 
date.

> I think I'd consider it required behaviour that
> pacemaker not consider itself authoritative enough to provide answers
> like "location" until it has gotten a fresh copy of the CIB.
> 
>> Does it show anything as running?  Any nodes as online?
> 
> 
>> I'd not expect that it stays in that situation for more than a second or 
>> two...
> 
> You are probably right about that.  But unfortunately that second or two
> provides a large enough window to provide mis-information.
> 
>> We could add an option to force crm_resource to use the master instance 
>> instead of the local one I guess.
> 
> Or, depending on the answers to above (like can this local-is-not-true
> situation every manifest itself at times other than "just started")
> perhaps just don't allow crm_resource (or any other tool) to provide
> information from the local CIB until it's been refreshed at least once
> since a startup.

As above, there are situations when you'd never get an answer.

> 
> I would much rather crm_resource experience some latency in being able
> to provide answers than provide wrong ones.  Perhaps there needs to be a
> switch to indicate if it should block waiting for the local CIB to be
> up-to-date or should return immediately with an "unknown" type response
> if the local CIB has not yet been updated since a start.
> 
> Cheers,
> b.
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-14 Thread Brian J. Murrell (brian)
On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote:
> 
> > On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
> >> 
> >> The local cib hasn't caught up yet by the looks of it.

I should have asked in my previous message: is this entirely an artifact
of having just restarted or are there any other times where the local
CIB can in fact be out of date (and thus crm_resource is inaccurate), if
even for a brief period of time?  I just want to completely understand
the nature of this situation.

> It doesn't know that it doesn't know.

But it (pacemaker at least) does know that it's just started up, and
should also know whether it's gotten a fresh copy of the CIB since
starting up, right?  I think I'd consider it required behaviour that
pacemaker not consider itself authoritative enough to provide answers
like "location" until it has gotten a fresh copy of the CIB.

> Does it show anything as running?  Any nodes as online?


> I'd not expect that it stays in that situation for more than a second or 
> two...

You are probably right about that.  But unfortunately that second or two
provides a large enough window to provide mis-information.

> We could add an option to force crm_resource to use the master instance 
> instead of the local one I guess.

Or, depending on the answers to above (like can this local-is-not-true
situation every manifest itself at times other than "just started")
perhaps just don't allow crm_resource (or any other tool) to provide
information from the local CIB until it's been refreshed at least once
since a startup.

I would much rather crm_resource experience some latency in being able
to provide answers than provide wrong ones.  Perhaps there needs to be a
switch to indicate if it should block waiting for the local CIB to be
up-to-date or should return immediately with an "unknown" type response
if the local CIB has not yet been updated since a start.

Cheers,
b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 3:41 pm, Brian J. Murrell (brian)  
wrote:

> On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
>> 
>> The local cib hasn't caught up yet by the looks of it.
> 
> Should crm_resource actually be [mis-]reporting as if it were
> knowledgeable when it's not though?  IOW is this expected behaviour or
> should it be considered a bug?  Should I open a ticket?

It doesn't know that it doesn't know.
Does it show anything as running?  Any nodes as online?

I'd not expect that it stays in that situation for more than a second or two...

> 
>> You could compare 'cibadmin -Ql' with 'cibadmin -Q'
> 
> Is there no other way to force crm_resource to be truthful/accurate or
> silent if it cannot be truthful/accurate?  Having to run this kind of
> pre-check before every crm_resource --locate seems like it's going to
> drive overhead up quite a bit.

True.

> 
> Maybe I am using the wrong tool for the job.  Is there a better tool
> than crm_resource to ascertain, with full truthfullness (or silence if
> truthfullness is not possible), where resources are running?

We could add an option to force crm_resource to use the master instance instead 
of the local one I guess.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)
On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
> 
> The local cib hasn't caught up yet by the looks of it.

Should crm_resource actually be [mis-]reporting as if it were
knowledgeable when it's not though?  IOW is this expected behaviour or
should it be considered a bug?  Should I open a ticket?

> You could compare 'cibadmin -Ql' with 'cibadmin -Q'

Is there no other way to force crm_resource to be truthful/accurate or
silent if it cannot be truthful/accurate?  Having to run this kind of
pre-check before every crm_resource --locate seems like it's going to
drive overhead up quite a bit.

Maybe I am using the wrong tool for the job.  Is there a better tool
than crm_resource to ascertain, with full truthfullness (or silence if
truthfullness is not possible), where resources are running?

Cheers,
b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof

On 14 Jan 2014, at 5:13 am, Brian J. Murrell (brian)  
wrote:

> Hi,
> 
> I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
> of "crm_resource -L" is not trust-able, shortly after a node is booted.
> 
> Here is the output from crm_resource -L on one of the nodes in a two
> node cluster (the one that was not rebooted):
> 
> st-fencing(stonith:fence_foo):Started 
> res1  (ocf::foo:Target):  Started 
> res2  (ocf::foo:Target):  Started 
> 
> Here is the output from the same command on the other node in the two
> node cluster right after it was rebooted:
> 
> st-fencing(stonith:fence_foo):Stopped 
> res1  (ocf::foo:Target):  Stopped 
> res2  (ocf::foo:Target):  Stopped 
> 
> These were collected at the same time (within the same second) on the
> two nodes.
> 
> Clearly the rebooted node is not telling the truth.  Perhaps the truth
> for it is "I don't know", which would be fair enough but that's not what
> pacemaker is asserting there.
> 
> So, how do I know (i.e. programmatically -- what command can I issue to
> know) if and when crm_resource can be trusted to be truthful?

The local cib hasn't caught up yet by the looks of it.
You could compare 'cibadmin -Ql' with 'cibadmin -Q'

> 
> b.
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)
Hi,

I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
of "crm_resource -L" is not trust-able, shortly after a node is booted.

Here is the output from crm_resource -L on one of the nodes in a two
node cluster (the one that was not rebooted):

 st-fencing (stonith:fence_foo):Started 
 res1   (ocf::foo:Target):  Started 
 res2   (ocf::foo:Target):  Started 

Here is the output from the same command on the other node in the two
node cluster right after it was rebooted:

 st-fencing (stonith:fence_foo):Stopped 
 res1   (ocf::foo:Target):  Stopped 
 res2   (ocf::foo:Target):  Stopped 

These were collected at the same time (within the same second) on the
two nodes.

Clearly the rebooted node is not telling the truth.  Perhaps the truth
for it is "I don't know", which would be fair enough but that's not what
pacemaker is asserting there.

So, how do I know (i.e. programmatically -- what command can I issue to
know) if and when crm_resource can be trusted to be truthful?

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org