On Wed, Nov 10, 2010 at 01:53:58PM +0800, Chen, Yanfei (NSN - CN/Cheng Du) wrote: > Hi > > We use redhat cluster + drbd architecture. Oracle use resource > res_drbd_oracle > The drbd version is 8.3.2 cluster version 2.0.46 > > We get the below error: > > Nov 4 11:43:49 clnode1 xinetd[4432]: EXIT: http status=0 pid=12414 > duration=0(sec) > Nov 6 15:25:30 clnode1 clurgmgrd[4691]: <notice> status on drbd > "res_drbd_oracle" returned 20 (unspecified) > Nov 6 15:25:30 clnode1 clurgmgrd[4691]: <notice> Stopping service > service:Oracle > Nov 6 15:25:47 clnode1 kernel: block drbd0: role( Primary -> Secondary > ) > > > The redhat cluster call the function drbd_status in drbd.sh to moniter > status, which is from drbd > > drbd_status() { > role=$(drbdadm role $OCF_RESKEY_resource) > case $role in > Primary/*) > return $OCF_RUNNING > ;; > Secondary/*) > return $OCF_NOT_RUNNING > ;; > > esac > return $OCF_ERR_GENERIC > }
If that is indeed the script that is used, exit code 20 is "impossible", exit code will either be $OCF_ERR_GENERIC (which is 1), $OCF_NOT_RUNNING (which is 7), or $OCF_RUNNING (which is ... wait... WTF!) OCF_RUNNING is non-existent. And as it is empty, it will expand to nothing, the statement will expand to "return", and return without argument is equivalent to "return $?", so it will return the exit status of the last command, which was "drbdadm role". Because usually, if drbdadm role is able to determine the role, it would exit 0, and if role was assigned Primary/..., drbdadm clearly was able to determine the role, this usually just worked "by accident". Still it should have been $OCF_SUCCESS there. Why drbdadm role would have an exit code of 20, while still returning Primary to stdout is beyond me for now. But that is the only way I can see that the above shell code would return 20. Unless, of course, the other $OCF_* are not defined as well, in which case the "return $OCF_ERR_GENERIC" would have been empty thus equivalent to "$?" as well. If that was the case, though, it would be a better fit: if drbdadm could not determine the role for whatever reason, role will be empty, and drbdadm probably exits with 20. But OCF_ERR_GENERIC being empty would mean that ocf-shellfuncs could not be sourced, which I find a bit unlikely. Please try this, and try to reproduce. Once you have a reproducer, it will be easy to fix. --- a/drbd.sh +++ b/drbd.sh @@ -68,7 +68,7 @@ drbd_status() { role=$(drbdadm role $OCF_RESKEY_resource) case $role in Primary/*) - return $OCF_RUNNING + return $OCF_SUCCESS ;; Secondary/*) return $OCF_NOT_RUNNING > This problem happened two times and lead oracle service restarted. > Appricated you help us to understand what's the error 20 meaning? How > could it happen? Do you have any further logs, kernel or other, from the time period in question? Or sysstat like info about the general workload at that time? Was the system particularly busy at the times when this happened? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user