On Jun 25, 2008, at 5:23 PM, Serge Dubrouski wrote:
On Wed, Jun 25, 2008 at 8:56 AM, Andrew Beekhof <[EMAIL PROTECTED]>
wrote:
On Wed, Jun 25, 2008 at 14:57, Serge Dubrouski <[EMAIL PROTECTED]>
wrote:
On Wed, Jun 25, 2008 at 6:15 AM, Serge Dubrouski
<[EMAIL PROTECTED]> wrote:
On Wed, Jun 25, 2008 at 5:29 AM, Dominik Klein <[EMAIL PROTECTED]
telegence.net> wrote:
Junko IKEDA wrote:
Unfortunately, the latest package produced the same results.
pgsql couldn't fail over using crm_resource -F.
I think you perhaps misunderstand what -F does... it is
intended to
tell the cluster that the resource failed.
Although it may move as well (depending on how you set up
the scores),
this is not the primary goal.
pgsql is set as, moves to the other node if it fails.
If crm_resrouce -F is called, pgsql's fail-count would be
increased from
0
to 1,
so pgsql should move to the appropriate node.
but pgsql was just stopped, and not moved.
Other resources were still running.
Ah ok, sorry just wanted to make sure the intended
functionality was
clear.
I had a look at the report and analysis.txt highlights the
problem quite
well:
pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op: Hard
error:
prmApPostgreSQLDB_fail_60000 failed with rc=2.
pengine[20727]: 2008/06/23_11:02:40 ERROR: unpack_rsc_op:
Preventing
prmApPostgreSQLDB from re-starting anywhere in the cluster
It looks like the RA (incorrectly) returned 2 (invalid
parameter),
instead of 3 (unimplemented function).
rc=2 tells the cluster that the configuration is invalid and
not to
bother starting the resource elsewhere.
!!! that means, there might be a problem at pgsql RA?
Thanks,
Junko
http://hg.linux-ha.org/dev/file/42ce605e3da5/resources/OCF/pgsql
Look at the end of the script.
If it is invoked in any other way, it calls usage which exits
OCF_ERR_ARGS
(ie 2). See how it was called. This should be the reason.
I wonder how this could pass ocf-tester. It does not support any
of the
notify operations nor validate-all nor meta-data.
Or am I looking at the wrong file?
You are looking at the right file, and I submitted a patch for this
problem a couple of weeks ago.
And here is one more patch that fixes the problem. Also I have a
couple of questions:
1. What is 'fail' operation is supposed to do?
"fail" :-)
That is to broad an explanation :-)
I just wonder what would be the best implementation for fail action
in RA. In this "fixed" version pgsql just reports "NOT_IMPLEMENTED",
crm increases fail_count and if score still allows to keep a resource
on a current node nothing else happens.
well it would also be restarted.
otherwise one could just as easily use crm_failcount.
I suspect that one would
expect a resource to be moved from the current node when "crm_resource
-F" is called, but I don't know how to correctly implement that on a
RA level.
use crm_failcount to set a value of INFINITY
May be the best way would if CRM not just incrased failcount but set
it to a value high enough for failing a resource over to another node?
This is not the purpose of crm_resource -F
If you want a resource to move, use -M
In this case RA would just stop a resource when it's called with
"fail" action.
no - it should say "i dont support this action".
and anyway you shouldn't rely on the RA being called at all... this
was only a temporary fix and will be going away now that there is an
LRM API call that the crm can use instead.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems