I had another "incorrect" failover this morning. The secondary server was responding and processing user requests fine (as far as customers and monitoring reported) but there is a nearly 2 minute gap in the sql log-where of course no updates are made to the servgrp_board table. According to what you say here, that would cause the failover alright...but what would cause the "hang-up" like this? There are actually no logs at all for this 2 minute time period....That is obviously a different issue.
BUT the other intervals it appears to be finding delinquency, I don't see a gap like that. All of the updates take place (see example below). Does it matter what order they happen in? What's the logic of figuring out delinquency? >From SQL log on ncc392: 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9944 */SELECT checkInterval FROM servgrp_config FOR UPDATE 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9956 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9976 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9981 */COMMIT WORK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0016 */SELECT checkInterval FROM servgrp_config FOR UPDATE 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0248 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0255 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0267 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0272 */COMMIT WORK But in the server group log on ncc391: <SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us determined to be delinquent with 4 operation(s) owned <SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us determined to be operational In the sql log on ncc391: Wed Jun 22 2011 09:59:00.9937 */SELECT checkInterval FROM servgrp_config FOR UPDATE Wed Jun 22 2011 09:59:01.0138 */OK Wed Jun 22 2011 09:59:01.0139 */UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = <servername> Wed Jun 22 2011 09:59:01.0145 */OK Wed Jun 22 2011 09:59:01.0145 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 Wed Jun 22 2011 09:59:01.0150 */OK Wed Jun 22 2011 09:59:01.0160 */COMMIT WORK Wed Jun 22 2011 09:59:31.0188 */SELECT checkInterval FROM servgrp_config FOR UPDATE Wed Jun 22 2011 09:59:31.0194 */OK Wed Jun 22 2011 09:59:31.0200 */OK Wed Jun 22 2011 09:59:31.0200 */UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = <servername> Wed Jun 22 2011 09:59:31.0202 */SELECT C18058 FROM T68 WHERE C1 = '000000000000015' Wed Jun 22 2011 09:59:31.0207 */OK Wed Jun 22 2011 09:59:31.0207 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 Wed Jun 22 2011 09:59:31.0209 */OK Wed Jun 22 2011 09:59:31.0212 */UPDATE T69 SET C18099=1,C18103=NULL,C5='AREMAILUSER_PQHg0JAAPwAA',C6=1308751171 WHERE C1 = '000000010478546' Wed Jun 22 2011 09:59:31.0215 */OK Wed Jun 22 2011 09:59:31.0220 */OK Wed Jun 22 2011 09:59:31.0220 */UPDATE T69 SET C18087 = NULL WHERE C1 IN ('000000010478546') Wed Jun 22 2011 09:59:31.0226 */COMMIT WORK Wed Jun 22 2011 09:59:31.0226 */OK Wed Jun 22 2011 09:59:31.0227 */COMMIT WORK Anne Ramey *********************************** E-mail correspondence to and from this address may be subject to the North Carolina Public Records Law and may be disclosed to third parties only by an authorized State Official. From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Walters, Mark Sent: Monday, May 23, 2011 10:51 AM To: arslist@ARSLIST.ORG Subject: Re: server group delinquency, inaccurate ** The signalling options should not be relevant for delinquency checks - this is done via updates to the database. In the SERVGRP_BOARD table there is a record for each server in your group. Every check interval a server will increment its own INTERVALCOUNT value and note the values of the other servers. At the next check interval the stored values are compared to those in the database and, if any of them have not changed, that server is considered delinquent. If this happens X successive times, where X is the delinquency threshold, then the server is considered to have failed and a failover of the services to the next highest ranked server takes place. Although there's a dedicated server group thread that handles this process it's not uncommon to see an occasional log entry reporting a delinquent server. This happens due to slight timing variations of when the intervalcount update process is performed on each server. Each server has an exclusive lock on the SERVGRP tables during the update process and this can lead to one server running 2 check interval updates before another server has been able to perform its update. However, as you had a failover, something must have prevented the intervalcount update on the server called secondary. You could try increasing the delinquency threshold to 3 to give each server more time for the update to run or enabling SQL logging and check the updates to the SERVGRP_BOARD table. You should see something like the following on each server every checkinterval; SELECT checkInterval FROM servgrp_config FOR UPDATE OK UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = 'myservername' OK SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 OK COMMIT WORK The SELECT FOR UPDATE is used to serialise access to the tables and provide exclusive access for the server during the update. If the clocks on the servers are sync'd you should be able to explain most of the delinquency reports in the server group log and perhaps understand why one was missed. Mark I work for BMC, I don't speak for them. From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Ramey, Anne Sent: 21 May 2011 02:14 To: arslist@ARSLIST.ORG Subject: server group delinquency, inaccurate ** Since we upgraded from ARS 7.1 to ARS 7.5 (patch 7), we've seen a lot of "incorrect" delinquency messages in the server group logs-each server detecting the other as delinquent quite frequently, even when they are responding just fine. It happens at all hours, even after backups are finished and no one is in the office for the morning yet. The systems are NOT busy, but they are being "delinquent". We are using the same settings of a 30 second interval with 2 intervals before failover that we used before the upgrade. Server-Group-Signal-Option is already false. We don't want to go all the way to Disable-ARSignals. It got bad enough today to happen for two intervals and cause a failover during the work day. This, of course, caused the system to hang for a moment, and it managed to kill the email engine-which didn't fail back over properly. Has anyone else with ARS 7.5 p 7 seen something like this? Primary: <SGRP> /* Fri May 20 2011 11:57:37.5506 */ Server <secondary> determined to be operational <SGRP> /* Fri May 20 2011 12:00:07.6808 */ Server <secondary> determined to be delinquent with 4 operation(s) owned <SGRP> /* Fri May 20 2011 12:00:37.7094 */ Server <secondary> determined to be operational <SGRP> /* Fri May 20 2011 16:49:52.4853 */ Server <secondary> determined to be delinquent with 4 operation(s) owned <SGRP> /* Fri May 20 2011 16:50:22.4912 */ Server <secondary> delinquent for 2 intervals with 4 operation(s) owned <SGRP> /* Fri May 20 2011 16:50:22.4913 */ Transferring pending full text indexing tasks from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.4939 */ Transferred 0 pending full text indexing tasks from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.4939 */ Failing over E-Mail Engine operation from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.4956 */ Resuming E-Mail Engine operation <SGRP> /* Fri May 20 2011 16:50:22.4993 */ Failing over Flashboards operation from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.5011 */ Resuming Flashboards operation <SGRP> /* Fri May 20 2011 16:50:22.5038 */ Failing over Assignment Engine operation from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.5055 */ Resuming Assignment Engine operation <SGRP> /* Fri May 20 2011 16:50:22.5096 */ Failing over Reconciliation Engine operation from server <secondary> to server <primary> <SGRP> /* Fri May 20 2011 16:50:22.5113 */ Resuming Reconciliation Engine operation <SGRP> /* Fri May 20 2011 16:54:22.6434 */ Server <secondary> determined to be operational <SGRP> /* Fri May 20 2011 16:54:22.6435 */ Server <primary> relinquishing E-Mail Engine operation to server <secondary> <SGRP> /* Fri May 20 2011 16:54:22.6435 */ Suspending E-Mail Engine operation <SGRP> /* Fri May 20 2011 16:54:22.6460 */ Server <primary> relinquishing Flashboards operation to server <secondary> <SGRP> /* Fri May 20 2011 16:54:22.6461 */ Suspending Flashboards operation <SGRP> /* Fri May 20 2011 16:54:22.6544 */ Server <primary> relinquishing Assignment Engine operation to server <secondary> <SGRP> /* Fri May 20 2011 16:54:22.6544 */ Suspending Assignment Engine operation <SGRP> /* Fri May 20 2011 16:54:22.6615 */ Server <primary> relinquishing Reconciliation Engine operation to server <secondary> <SGRP> /* Fri May 20 2011 16:54:22.6616 */ Suspending Reconciliation Engine operation <SGRP> /* Fri May 20 2011 18:44:11.7233 */ Server <secondary> determined to be delinquent with 4 operation(s) owned <SGRP> /* Fri May 20 2011 18:44:41.7522 */ Server <secondary> determined to be operational <SGRP> /* Fri May 20 2011 18:56:42.0804 */ Server <secondary> determined to be delinquent with 4 operation(s) owned <SGRP> /* Fri May 20 2011 18:57:12.0894 */ Server <secondary> determined to be operational <SGRP> /* Fri May 20 2011 18:58:42.1558 */ Server <secondary> determined to be delinquent with 4 operation(s) owned <SGRP> /* Fri May 20 2011 18:59:12.2134 */ Server <secondary> determined to be operational Secondary: <SGRP> /* Fri May 20 2011 11:56:37.4981 */ Server <primary> determined to be operational <SGRP> /* Fri May 20 2011 11:59:37.6687 */ Server <primary> determined to be delinquent with 8 operation(s) owned <SGRP> /* Fri May 20 2011 12:00:07.9562 */ Server <primary> determined to be operational <SGRP> /* Fri May 20 2011 16:54:02.0895 */ Ownership of the E-Mail Engine operation removed unexpectedly from server <secondary> <SGRP> /* Fri May 20 2011 16:54:02.0896 */ Suspending E-Mail Engine operation <SGRP> /* Fri May 20 2011 16:54:02.0932 */ Ownership of the Flashboards operation removed unexpectedly from server <secondary> <SGRP> /* Fri May 20 2011 16:54:02.0933 */ Suspending Flashboards operation <SGRP> /* Fri May 20 2011 16:54:02.0963 */ Ownership of the Assignment Engine operation removed unexpectedly from server <secondary> <SGRP> /* Fri May 20 2011 16:54:02.0963 */ Suspending Assignment Engine operation <SGRP> /* Fri May 20 2011 16:54:02.1026 */ Ownership of the Reconciliation Engine operation removed unexpectedly from server <secondary> <SGRP> /* Fri May 20 2011 16:54:02.1027 */ Suspending Reconciliation Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1604 */ Server <secondary> assuming responsibility for unowned E-Mail Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1605 */ Resuming E-Mail Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1634 */ Server <secondary> assuming responsibility for unowned Flashboards operation <SGRP> /* Fri May 20 2011 16:54:32.1635 */ Resuming Flashboards operation <SGRP> /* Fri May 20 2011 16:54:32.1683 */ Server <secondary> assuming responsibility for unowned Assignment Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1684 */ Resuming Assignment Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1719 */ Server <secondary> assuming responsibility for unowned Reconciliation Engine operation <SGRP> /* Fri May 20 2011 16:54:32.1719 */ Resuming Reconciliation Engine operation <SGRP> /* Fri May 20 2011 18:51:41.9330 */ Server <primary> determined to be delinquent with 8 operation(s) owned <SGRP> /* Fri May 20 2011 18:52:11.9556 */ Server <primary> determined to be operational <SGRP> /* Fri May 20 2011 18:57:42.1122 */ Server <primary> determined to be delinquent with 8 operation(s) owned <SGRP> /* Fri May 20 2011 18:58:12.1257 */ Server <primary> determined to be operational Anne Ramey *********************************** E-mail correspondence to and from this address may be subject to the North Carolina Public Records Law and may be disclosed to third parties only by an authorized State Official. _attend WWRUG11 www.wwrug.com<http://www.wwrug.com> ARSlist: "Where the Answers Are"_ _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ _______________________________________________________________________________ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"