Re: [Linux-HA] sometimes crm_resource -F fails

Andrew Beekhof Wed, 25 Jun 2008 10:34:24 -0700


On Jun 25, 2008, at 5:23 PM, Serge Dubrouski wrote:

On Wed, Jun 25, 2008 at 8:56 AM, Andrew Beekhof <[EMAIL PROTECTED]>wrote:
On Wed, Jun 25, 2008 at 14:57, Serge Dubrouski <[EMAIL PROTECTED]>wrote:
On Wed, Jun 25, 2008 at 6:15 AM, Serge Dubrouski<[EMAIL PROTECTED]> wrote:
On Wed, Jun 25, 2008 at 5:29 AM, Dominik Klein <[EMAIL PROTECTED]telegence.net> wrote:
Junko IKEDA wrote:
Unfortunately, the latest package produced the same results.
pgsql couldn't fail over using crm_resource -F.
I think you perhaps misunderstand what -F does... it isintended to
tell the cluster that the resource failed.
Although it may move as well (depending on how you set upthe scores),
this is not the primary goal.
pgsql is set as, moves to the other node if it fails.
If crm_resrouce -F is called, pgsql's fail-count would beincreased from
0
to 1,
so pgsql should move to the appropriate node.
but pgsql was just stopped, and not moved.
Other resources were still running.
Ah ok, sorry just wanted to make sure the intendedfunctionality was
clear.
I had a look at the report and analysis.txt highlights theproblem quite
well:
pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op: Harderror:
prmApPostgreSQLDB_fail_60000 failed with rc=2.
pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op:Preventing
prmApPostgreSQLDB from re-starting anywhere in the cluster
It looks like the RA (incorrectly) returned 2 (invalidparameter),
instead of 3 (unimplemented function).
rc=2 tells the cluster that the configuration is invalid andnot to
bother starting the resource elsewhere.
!!! that means, there might be a problem at pgsql RA?

Thanks,
Junko
http://hg.linux-ha.org/dev/file/42ce605e3da5/resources/OCF/pgsql

Look at the end of the script.
If it is invoked in any other way, it calls usage which exitsOCF_ERR_ARGS
(ie 2). See how it was called. This should be the reason.
I wonder how this could pass ocf-tester. It does not support anyof the
notify operations nor validate-all nor meta-data.

Or am I looking at the wrong file?
You are looking at the right file, and I submitted a patch for this
problem a couple of weeks ago.
And here is one more patch that fixes the problem. Also I have a
couple of questions:

1. What is 'fail' operation is supposed to do?
"fail" :-)
That is to broad an explanation :-)

I just wonder what would be the best implementation for fail action
in RA. In this "fixed" version pgsql just reports "NOT_IMPLEMENTED",
crm increases fail_count and if score still allows to keep a resource
on a current node nothing else happens.


well it would also be restarted.
otherwise one could just as easily use crm_failcount.

I suspect that one would
expect a resource to be moved from the current node when "crm_resource
-F" is called, but I don't know how to correctly implement that on a
RA level.


use crm_failcount to set a value of INFINITY


May be the best way would if CRM not just incrased failcount but set
it to a value high enough for failing a resource over to another node?


This is not the purpose of crm_resource -F
If you want a resource to move, use -M


In this case RA would just stop a resource when it's called with
"fail" action.


no - it should say "i dont support this action".

and anyway you shouldn't rely on the RA being called at all... thiswas only a temporary fix and will be going away now that there is anLRM API call that the crm can use instead.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] sometimes crm_resource -F fails

Reply via email to