On Wed, Nov 10, 2010 at 01:53:58PM +0800, Chen, Yanfei (NSN - CN/Cheng Du) 
wrote:
> Hi
> 
> We use redhat cluster + drbd architecture. Oracle use resource
> res_drbd_oracle
> The drbd version is 8.3.2   cluster version 2.0.46
> 
> We get the below error:
> 
> Nov  4 11:43:49 clnode1 xinetd[4432]: EXIT: http status=0 pid=12414
> duration=0(sec)
> Nov  6 15:25:30 clnode1 clurgmgrd[4691]: <notice> status on drbd
> "res_drbd_oracle" returned 20 (unspecified) 
> Nov  6 15:25:30 clnode1 clurgmgrd[4691]: <notice> Stopping service
> service:Oracle 
> Nov  6 15:25:47 clnode1 kernel: block drbd0: role( Primary -> Secondary
> ) 
> 
> 
> The redhat cluster call the function drbd_status  in drbd.sh to moniter
> status, which is from drbd
> 
> drbd_status() {
>     role=$(drbdadm role $OCF_RESKEY_resource)
>     case $role in
>         Primary/*)
>             return $OCF_RUNNING
>             ;;
>         Secondary/*)
>             return $OCF_NOT_RUNNING
>             ;;
> 
>     esac
>     return $OCF_ERR_GENERIC
> }

If that is indeed the script that is used,
exit code 20 is "impossible",
exit code will either be $OCF_ERR_GENERIC (which is 1),
$OCF_NOT_RUNNING (which is 7), or
$OCF_RUNNING (which is ... wait... WTF!)

OCF_RUNNING is non-existent. And as it is empty, it will expand to nothing,
the statement will expand to "return", and return without argument
is equivalent to "return $?", so it will return the exit status of the
last command, which was "drbdadm role".
Because usually, if drbdadm role is able to determine the role,
it would exit 0, and if role was assigned Primary/..., drbdadm clearly
was able to determine the role, this usually just worked "by accident".
Still it should have been $OCF_SUCCESS there.

Why drbdadm role would have an exit code of 20,
while still returning Primary to stdout is beyond me for now.

But that is the only way I can see that the above shell code would return 20.

Unless, of course, the other $OCF_* are not defined as well, in which
case the "return $OCF_ERR_GENERIC" would have been empty thus equivalent
to "$?" as well.  If that was the case, though, it would be a better
fit: if drbdadm could not determine the role for whatever reason, role
will be empty, and drbdadm probably exits with 20.
But OCF_ERR_GENERIC being empty would mean that ocf-shellfuncs could not
be sourced, which I find a bit unlikely.


Please try this, and try to reproduce.
Once you have a reproducer, it will be easy to fix.
--- a/drbd.sh
+++ b/drbd.sh
@@ -68,7 +68,7 @@ drbd_status() {
     role=$(drbdadm role $OCF_RESKEY_resource)
     case $role in
        Primary/*)
-           return $OCF_RUNNING
+           return $OCF_SUCCESS
            ;;
        Secondary/*)
            return $OCF_NOT_RUNNING


> This problem happened two times and lead oracle service restarted.
> Appricated you help us to understand what's the error 20 meaning? How
> could it happen?

Do you have any further logs, kernel or other, from the time period in question?
Or sysstat like info about the general workload at that time?
Was the system particularly busy at the times when this happened?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to