The lack of any logging for 2 minutes suggests the server hung for some reason 
and, as you say, that's a different issue to the phantom failovers.

Looking at the log lines below;

But in the server group log on ncc391:
<SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us 
determined to be delinquent with 4 operation(s) owned
<SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us 
determined to be operational

This indicates that when ncc391 read the intervalCount from the database for 
ncc392 it was the same value as in the previous check 30 seconds earlier.   
Every checkInterval each server increments its own intervalCount by 1 and reads 
the intervalCount values for the other server(s).  We can see ncc391 reading 
the intervalCount;

Wed Jun 22 2011 09:59:01.0145 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
Wed Jun 22 2011 09:59:31.0207 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1

And ncc392 updating its value;

000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9976 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9981 */COMMIT WORK

000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0255 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0267 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0272 */COMMIT WORK

Assuming the clocks on the two systems are in sync;

ncc392 updates intervalCount to X                           Wed Jun 22 2011 
09:59:00.9981 */COMMIT WORK
(I'm using the COMMIT time as other clients would not see the update in the 
database until it had been committed)
ncc391 reads the intervalCount as X                        Wed Jun 22 2011 
09:59:01.0145 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
ncc391 reads the intervalCount as X                        Wed Jun 22 2011 
09:59:31.0207 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
ncc392 updates intervalCount to X+1                      Wed Jun 22 2011 
09:59:31.0272 */COMMIT WORK

What has happened is that, as the two servers are running their server group 
checks at almost exactly the same time, even very small variations in the 30 
second interval can lead to server A reading the intervalCount for server B 
twice before it has been updated.  This leads to the delinquent entry in the 
server group log.   In the case above ncc392 updated the intervalCount to X+1 
at 09:59:31.0272 and ncc391 will have read this at the 10:00:01 check run - the 
intervalCount will have increased so the delinquency count would be reset.  A 
failover should only happen when N checks have been run without the 
intervalCount increasing (where N is your delinquency threshold from the Server 
Group Operation Ranking form).

Note that a log entry reporting a delinquent server does not indicate a 
failover - only that a server has not incremented its intervalCount as shown in 
the example above.  An actual failover looks like this;

<SGRP> /* Thu Jun 23 2011 08:57:14.8475 */ Server moetz02 delinquent for 2 
intervals with 1 operation(s) owned
<SGRP> /* Thu Jun 23 2011 08:57:14.8478 */ Failing over Escalation operation 
from server moetz02 to server moetz01

Hope that helps

Mark

I work for BMC, I don't speak for them.


From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Ramey, Anne
Sent: 22 June 2011 18:01
To: arslist@ARSLIST.ORG
Subject: Re: server group delinquency, inaccurate

**
I had another "incorrect" failover this morning.  The secondary server was 
responding and processing user requests fine (as far as customers and 
monitoring reported) but there is a nearly 2 minute gap in the sql log-where of 
course no updates are made to the servgrp_board table.  According to what you 
say here, that would cause the failover alright...but what would cause the 
"hang-up" like this?  There are actually no logs at all for this 2 minute time 
period....That is obviously a different issue.

BUT the other intervals it appears to be finding delinquency, I don't see a gap 
like that.  All of the updates take place (see example below).  Does it matter 
what order they happen in?  What's the logic of figuring out delinquency?

>From SQL log on ncc392:
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9944 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9956 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9976 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9981 */COMMIT WORK

000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0016 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0248 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0255 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0267 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0272 */COMMIT WORK

But in the server group log on ncc391:
<SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us 
determined to be delinquent with 4 operation(s) owned
<SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us 
determined to be operational

In the sql log on ncc391:
Wed Jun 22 2011 09:59:00.9937 */SELECT checkInterval FROM servgrp_config FOR 
UPDATE
Wed Jun 22 2011 09:59:01.0138 */OK
Wed Jun 22 2011 09:59:01.0139 */UPDATE servgrp_board SET intervalCount = 
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:01.0145 */OK
Wed Jun 22 2011 09:59:01.0145 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
Wed Jun 22 2011 09:59:01.0150 */OK
Wed Jun 22 2011 09:59:01.0160 */COMMIT WORK

Wed Jun 22 2011 09:59:31.0188 */SELECT checkInterval FROM servgrp_config FOR 
UPDATE
Wed Jun 22 2011 09:59:31.0194 */OK
Wed Jun 22 2011 09:59:31.0200 */OK
Wed Jun 22 2011 09:59:31.0200 */UPDATE servgrp_board SET intervalCount = 
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:31.0202 */SELECT C18058 FROM T68 WHERE C1 = 
'000000000000015'
Wed Jun 22 2011 09:59:31.0207 */OK
Wed Jun 22 2011 09:59:31.0207 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
Wed Jun 22 2011 09:59:31.0209 */OK
Wed Jun 22 2011 09:59:31.0212 */UPDATE T69 SET 
C18099=1,C18103=NULL,C5='AREMAILUSER_PQHg0JAAPwAA',C6=1308751171 WHERE C1 = 
'000000010478546'
Wed Jun 22 2011 09:59:31.0215 */OK
Wed Jun 22 2011 09:59:31.0220 */OK
Wed Jun 22 2011 09:59:31.0220 */UPDATE T69 SET C18087 = NULL WHERE C1 IN 
('000000010478546')
Wed Jun 22 2011 09:59:31.0226 */COMMIT WORK
Wed Jun 22 2011 09:59:31.0226 */OK
Wed Jun 22 2011 09:59:31.0227 */COMMIT WORK


Anne Ramey
***********************************
E-mail correspondence to and from this address may be subject to the North 
Carolina Public Records Law and may be disclosed to third parties only by an 
authorized State Official.

From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Walters, Mark
Sent: Monday, May 23, 2011 10:51 AM
To: arslist@ARSLIST.ORG
Subject: Re: server group delinquency, inaccurate

**
The signalling options should not be relevant for delinquency checks - this is 
done via updates to the database.  In the SERVGRP_BOARD table there is a record 
for each server in your group.  Every check interval a server will increment 
its own INTERVALCOUNT value and note the values of the other servers.  At the 
next check interval the stored values are compared to those in the database 
and, if any of them have not changed, that server is considered delinquent.  If 
this happens X successive times, where X is the delinquency threshold, then the 
server is considered to have failed and a failover of the services to the next 
highest ranked server takes place.

Although there's a dedicated server group thread that handles this process it's 
not uncommon to see an occasional log entry reporting a delinquent server.  
This happens due to slight timing variations of when the intervalcount update 
process is performed on each server.   Each server has an exclusive lock on the 
SERVGRP tables during the update process and this can lead to one server 
running 2 check interval updates before another server has been able to perform 
its update.

However, as you had a failover, something must have prevented the intervalcount 
update on the server called secondary.  You could try increasing the 
delinquency threshold to 3 to give each server more time for the update to run 
or enabling SQL logging and check the updates to the SERVGRP_BOARD table.  You 
should see something like the following on each server every checkinterval;

SELECT checkInterval FROM servgrp_config FOR UPDATE
OK
UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = 
'myservername'
OK
SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM 
servgrp_board ORDER BY 1
OK
COMMIT WORK

The SELECT  FOR UPDATE is used to serialise access to the tables and provide 
exclusive access for the server during the update.  If the clocks on the 
servers are sync'd you should be able to explain most of the delinquency 
reports in the server group log and perhaps understand why one was missed.

Mark

I work for BMC, I don't speak for them.



_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"

Reply via email to