Re: [ClusterLabs] Approach to validate on stop op (Was Re: crmsh configure delete for constraints)

2016-03-29 Thread Vladislav Bogdanov

29.03.2016 15:28, Vladislav Bogdanov wrote:
[...]

 *) # monitor | notify | reload | etc
 validate
 ret=$?
 if [ ${ret} -ne $OCF_SUCCESS ] ; then
 if ocf_is_probe ; then
 exit $OCF_NOT_RUNNING
 fi
 exit $?


Of course it is exit ${ret}


 fi
 ;;





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Approach to validate on stop op (Was Re: crmsh configure delete for constraints)

2016-03-29 Thread Vladislav Bogdanov

10.02.2016 12:31, Vladislav Bogdanov wrote:

10.02.2016 11:38, Ulrich Windl wrote:

Vladislav Bogdanov  schrieb am 10.02.2016 um
05:39 in

Nachricht <6e479808-6362-4932-b2c6-348c7efc4...@hoster-ok.com>:

[...]

Well, I'd reword. Generally, RA should not exit with error if validation
fails on stop.
Is that better?

[...]

As we have different error codes, what type of error?


Any which makes pacemaker to think resource stop op failed.
OCF_ERR_* particularly.

If pacemaker has got an error on start, it will run stop with the same
set of parameters anyways. And will get error again if that one was from
validation and RA does not differentiate validation for start and stop.
And then circular fencing over the whole cluster is triggered for no
reason.

Of course, for safety, RA could save its state if start was successful
and skip validation on stop only if that state is not found. Otherwise
removed binary or config file would result in resource running on
several nodes.

Well, this all seems to be very complicated to make some general
algorithm ;)


Well, after some thinking, I've got an approach which sounds both 
elegant and safe enough to me and my colleagues. Please look at the 
following excerpt (part of hypothetical RA before the main 'case'):


-
VALIDATION_FAILURE_FLAG="${HA_RSCTMP}/${OCF_RESOURCE_INSTANCE}.invalid"

case "${__OCF_ACTION}" in
meta-data)
meta_data
exit $OCF_SUCCESS
;;
usage|help)
usage
exit $OCF_SUCCESS
;;
start)
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
touch "${VALIDATION_FAILURE_FLAG}"
exit ${ret}
fi
;;
stop)
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
if [ -f "${VALIDATION_FAILURE_FLAG}" ] ; then
rm -f "${VALIDATION_FAILURE_FLAG}"
exit $OCF_SUCCESS
else
exit ${ret}
fi
fi
;;
*) # monitor | notify | reload | etc
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
if ocf_is_probe ; then
exit $OCF_NOT_RUNNING
fi
exit $?
fi
;;
esac
-

Above assumes that validation function does not call exit (and thus uses 
have_binary instead of check_binary, etc.) but returns an error code.


The main difference to the current ocf_rarun implementation is that 
changes to machine environment (deleted binaries, configs, etc.) still 
result in stop failure (and thus fencing) if that changes were made 
after the successful validation on resource start.


I plan to extensively test such approach in my RAs shortly.

Comments are welcome.

Best,
Vladislav







Regards,
Ulrich



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org