The lack of any logging for 2 minutes suggests the server hung for some reason and, as you say, that's a different issue to the phantom failovers.
Looking at the log lines below; But in the server group log on ncc391: <SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us determined to be delinquent with 4 operation(s) owned <SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us determined to be operational This indicates that when ncc391 read the intervalCount from the database for ncc392 it was the same value as in the previous check 30 seconds earlier. Every checkInterval each server increments its own intervalCount by 1 and reads the intervalCount values for the other server(s). We can see ncc391 reading the intervalCount; Wed Jun 22 2011 09:59:01.0145 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 Wed Jun 22 2011 09:59:31.0207 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 And ncc392 updating its value; 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9976 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9981 */COMMIT WORK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0255 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0267 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0272 */COMMIT WORK Assuming the clocks on the two systems are in sync; ncc392 updates intervalCount to X Wed Jun 22 2011 09:59:00.9981 */COMMIT WORK (I'm using the COMMIT time as other clients would not see the update in the database until it had been committed) ncc391 reads the intervalCount as X Wed Jun 22 2011 09:59:01.0145 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 ncc391 reads the intervalCount as X Wed Jun 22 2011 09:59:31.0207 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 ncc392 updates intervalCount to X+1 Wed Jun 22 2011 09:59:31.0272 */COMMIT WORK What has happened is that, as the two servers are running their server group checks at almost exactly the same time, even very small variations in the 30 second interval can lead to server A reading the intervalCount for server B twice before it has been updated. This leads to the delinquent entry in the server group log. In the case above ncc392 updated the intervalCount to X+1 at 09:59:31.0272 and ncc391 will have read this at the 10:00:01 check run - the intervalCount will have increased so the delinquency count would be reset. A failover should only happen when N checks have been run without the intervalCount increasing (where N is your delinquency threshold from the Server Group Operation Ranking form). Note that a log entry reporting a delinquent server does not indicate a failover - only that a server has not incremented its intervalCount as shown in the example above. An actual failover looks like this; <SGRP> /* Thu Jun 23 2011 08:57:14.8475 */ Server moetz02 delinquent for 2 intervals with 1 operation(s) owned <SGRP> /* Thu Jun 23 2011 08:57:14.8478 */ Failing over Escalation operation from server moetz02 to server moetz01 Hope that helps Mark I work for BMC, I don't speak for them. From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Ramey, Anne Sent: 22 June 2011 18:01 To: arslist@ARSLIST.ORG Subject: Re: server group delinquency, inaccurate ** I had another "incorrect" failover this morning. The secondary server was responding and processing user requests fine (as far as customers and monitoring reported) but there is a nearly 2 minute gap in the sql log-where of course no updates are made to the servgrp_board table. According to what you say here, that would cause the failover alright...but what would cause the "hang-up" like this? There are actually no logs at all for this 2 minute time period....That is obviously a different issue. BUT the other intervals it appears to be finding delinquency, I don't see a gap like that. All of the updates take place (see example below). Does it matter what order they happen in? What's the logic of figuring out delinquency? >From SQL log on ncc392: 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9944 */SELECT checkInterval FROM servgrp_config FOR UPDATE 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9956 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9976 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:00.9981 */COMMIT WORK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0016 */SELECT checkInterval FROM servgrp_config FOR UPDATE 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0248 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0255 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0267 */OK 000000 > <USER: > /* Wed Jun 22 2011 09:59:31.0272 */COMMIT WORK But in the server group log on ncc391: <SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us determined to be delinquent with 4 operation(s) owned <SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us determined to be operational In the sql log on ncc391: Wed Jun 22 2011 09:59:00.9937 */SELECT checkInterval FROM servgrp_config FOR UPDATE Wed Jun 22 2011 09:59:01.0138 */OK Wed Jun 22 2011 09:59:01.0139 */UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = <servername> Wed Jun 22 2011 09:59:01.0145 */OK Wed Jun 22 2011 09:59:01.0145 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 Wed Jun 22 2011 09:59:01.0150 */OK Wed Jun 22 2011 09:59:01.0160 */COMMIT WORK Wed Jun 22 2011 09:59:31.0188 */SELECT checkInterval FROM servgrp_config FOR UPDATE Wed Jun 22 2011 09:59:31.0194 */OK Wed Jun 22 2011 09:59:31.0200 */OK Wed Jun 22 2011 09:59:31.0200 */UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = <servername> Wed Jun 22 2011 09:59:31.0202 */SELECT C18058 FROM T68 WHERE C1 = '000000000000015' Wed Jun 22 2011 09:59:31.0207 */OK Wed Jun 22 2011 09:59:31.0207 */SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 Wed Jun 22 2011 09:59:31.0209 */OK Wed Jun 22 2011 09:59:31.0212 */UPDATE T69 SET C18099=1,C18103=NULL,C5='AREMAILUSER_PQHg0JAAPwAA',C6=1308751171 WHERE C1 = '000000010478546' Wed Jun 22 2011 09:59:31.0215 */OK Wed Jun 22 2011 09:59:31.0220 */OK Wed Jun 22 2011 09:59:31.0220 */UPDATE T69 SET C18087 = NULL WHERE C1 IN ('000000010478546') Wed Jun 22 2011 09:59:31.0226 */COMMIT WORK Wed Jun 22 2011 09:59:31.0226 */OK Wed Jun 22 2011 09:59:31.0227 */COMMIT WORK Anne Ramey *********************************** E-mail correspondence to and from this address may be subject to the North Carolina Public Records Law and may be disclosed to third parties only by an authorized State Official. From: Action Request System discussion list(ARSList) [mailto:arslist@ARSLIST.ORG] On Behalf Of Walters, Mark Sent: Monday, May 23, 2011 10:51 AM To: arslist@ARSLIST.ORG Subject: Re: server group delinquency, inaccurate ** The signalling options should not be relevant for delinquency checks - this is done via updates to the database. In the SERVGRP_BOARD table there is a record for each server in your group. Every check interval a server will increment its own INTERVALCOUNT value and note the values of the other servers. At the next check interval the stored values are compared to those in the database and, if any of them have not changed, that server is considered delinquent. If this happens X successive times, where X is the delinquency threshold, then the server is considered to have failed and a failover of the services to the next highest ranked server takes place. Although there's a dedicated server group thread that handles this process it's not uncommon to see an occasional log entry reporting a delinquent server. This happens due to slight timing variations of when the intervalcount update process is performed on each server. Each server has an exclusive lock on the SERVGRP tables during the update process and this can lead to one server running 2 check interval updates before another server has been able to perform its update. However, as you had a failover, something must have prevented the intervalcount update on the server called secondary. You could try increasing the delinquency threshold to 3 to give each server more time for the update to run or enabling SQL logging and check the updates to the SERVGRP_BOARD table. You should see something like the following on each server every checkinterval; SELECT checkInterval FROM servgrp_config FOR UPDATE OK UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = 'myservername' OK SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board ORDER BY 1 OK COMMIT WORK The SELECT FOR UPDATE is used to serialise access to the tables and provide exclusive access for the server during the update. If the clocks on the servers are sync'd you should be able to explain most of the delinquency reports in the server group log and perhaps understand why one was missed. Mark I work for BMC, I don't speak for them. _______________________________________________________________________________ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"