On 05/20/2016 06:20 AM, Felix Zachlod (Lists) wrote:
> version 1.1.13-10.el7_2.2-44eb2dd
>
> Hello!
>
> I am currently developing a master/slave resource agent. So far it is working
> just fine, but this resource agent implements reload() and this does not work
> as expected when running as Master:
> The reload action is invoked and it succeeds returning 0. The resource is
> still Master and monitor will return $OCF_RUNNING_MASTER.
>
> But Pacemaker considers the instance being slave afterwards. Actually only
> reload is invoked, no monitor, no demote etc.
>
> I first thought that reload should possibly return $OCF_RUNNING_MASTER too
> but this leads to the resource failing on reload. It seems 0 is the only
> valid return code.
>
> I can recover the cluster state running resource $resourcename promote, which
> will call
>
> notify
> promote
> notify
>
> Afterwards my resource is considered Master again. After PEngine Recheck
> Timer (I_PE_CALC) just popped (90ms), the cluster manager will promote
> the resource itself.
> But this can lead to unexpected results, it could promote the resource on the
> wrong node so that both sides are actually running master, the cluster will
> not even notice it does not call monitor either.
>
> Is this a bug?
>
> regards, Felix
I think it depends on your point of view :)
Reload is implemented as an alternative to stop-then-start. For m/s
clones, start leaves the resource in slave state.
So on the one hand, it makes sense that Pacemaker would expect a m/s
reload to end up in slave state, regardless of the initial state, since
it should be equivalent to stop-then-start.
On the other hand, you could argue that a reload for a master should
logically be an alternative to demote-stop-start-promote.
On the third hand ;) you could argue that reload is ambiguous for master
resources and thus shouldn't be supported at all.
Feel free to open a feature request at http://bugs.clusterlabs.org/ to
say how you think it should work.
As an aside, I think the current implementation of reload in pacemaker
is unsatisfactory for two reasons:
* Using the "unique" attribute to determine whether a parameter is
reloadable was a bad idea. For example, the location of a daemon binary
is generally set to unique=0, which is sensible in that multiple RA
instances can use the same binary, but a reload could not handle that
change. It is not a problem only because no one ever changes that.
* There is a fundamental misunderstanding between pacemaker and most RA
developers as to what reload means. Pacemaker uses the reload action to
make parameter changes in the resource's *pacemaker* configuration take
effect, but RA developers tend to use it to reload the service's own
configuration files (a more natural interpretation, but completely
different from how pacemaker uses it).
> trace May 20 12:58:31 cib_create_op(609):0: Sending call options: 0010,
> 1048576
> trace May 20 12:58:31 cib_native_perform_op_delegate(384):0: Sending
> cib_modify message to CIB service (timeout=120s)
> trace May 20 12:58:31 crm_ipc_send(1175):0: Sending from client: cib_shm
> request id: 745 bytes: 1070 timeout:12 msg...
> trace May 20 12:58:31 crm_ipc_send(1188):0: Message sent, not waiting for
> reply to 745 from cib_shm to 1070 bytes...
> trace May 20 12:58:31 cib_native_perform_op_delegate(395):0: Reply: No data
> to dump as XML
> trace May 20 12:58:31 cib_native_perform_op_delegate(398):0: Async call,
> returning 268
> trace May 20 12:58:31 do_update_resource(2274):0: Sent resource state
> update message: 268 for reload=0 on scst_dg_ssd
> trace May 20 12:58:31 cib_client_register_callback_full(606):0: Adding
> callback cib_rsc_callback for call 268
> trace May 20 12:58:31 process_lrm_event(2374):0: Op scst_dg_ssd_reload_0
> (call=449, stop-id=scst_dg_ssd:449, remaining=3): Confirmed
> notice May 20 12:58:31 process_lrm_event(2392):0: Operation
> scst_dg_ssd_reload_0: ok (node=alpha, call=449, rc=0, cib-update=268,
> confirmed=true)
> debug May 20 12:58:31 update_history_cache(196):0: Updating history for
> 'scst_dg_ssd' with reload op
> trace May 20 12:58:31 crm_ipc_read(992):0: No message from lrmd received:
> Resource temporarily unavailable
> trace May 20 12:58:31 mainloop_gio_callback(654):0: Message acquisition
> from lrmd[0x22b0ec0] failed: No message of desired type (-42)
> trace May 20 12:58:31 crm_fsa_trigger(293):0: Invoked (queue len: 0)
> trace May 20 12:58:31 s_crmd_fsa(159):0: FSA invoked with Cause:
> C_FSA_INTERNAL State: S_NOT_DC
> trace May 20 12:58:31 s_crmd_fsa(246):0: Exiting the FSA
> trace May 20 12:58:31 crm_fsa_trigger(295):0: Exited (queue len: 0)
> trace May 20 12:58:31 crm_ipc_read(989):0: Received cib_shm event 2108,
> size=183, rc=183, text: cib_callid="268" cib_clientid="60010689-7350-4916-a7bd-bd85ff
> trace May 20 12:58:31 mainloop_gio_callback(659):0: New message from
> cib_shm[0x23b7ab0]