VSWITCH Controller failover issue on z/VM 5.2

2006-04-24 Thread Brian Nielsen
This weekend the LAN team upgraded the CISCO router connected to the OSA 

card.  The VSWITCH controller console shows the message:

   DTCOSD309W Received adapter-initiated Stop Lan

after which it tries to fail over from the primary OSA to the backup OSA.
  
Eventually I see:

   DTCOSD306I Received adapter-initiated Start Lan

and the backup OSA addresses get started fine and assign all the IP 
addresses.

There are several more iterations of Stop Lan and Start Lan, but it is 

always on the backup OSA addresses, never on the primary OSA addresses.

So the question is: why didn't it ever try to restart on the primary OSA 

when the backup OSA received the Stop Lan?  If both OSA cards were the 

same it might not matter that much, but the primary is a Gig-OSA and the 

backup is a Fast-OSA.

-

Oh, and just to make it more fun, lastly there is a Stop Lan, the various
 
normal messages about stopping, attempting to restart, and:

   AMPX036I ASSERTION FAILURE CHECKING ERROR
 TRACE BACK OF CALLED ROUTINES
ROUTINE  STMT AT ADDRESS IN MODULE
SPSM_BLKALLOCATE   3900D429D0   TCFPSM_FPSM
INITDCB4800DA079A   TCTOOSD_TOOSD
TOOSDINIT  3300DA2DF4   TCTOOSD_TOOSD
CALLINITRTN3000CB9506   TCPARSE_PARSETCP
PROCESSSTARTSTOPSTATEMENT  3300CAE37E   TCPARSE_PARSETCP
PARSEOPTION   29200CB3C00   TCPARSE_PARSETCP
RECEIVECONTROLLERMSGFROMCP 2500D9F010   TCTOOSD_TOOSD
TOIUCV18000D3D580   TCTOIUC_TOIUCV
Schedule 167000CFA118
MAIN-PROGRAM 1400C441FE   TCPIP
VSPASCAL 00E4D74A

   DTCOSD100E Insufficient Fixed Page Storage Pool storage

after which it tries to shutdown and then goes into repeated:

  AMPX015I ADDRESSING EXCEPTION

abends.  Needless to say there were a lot of dump files in spool space.

So, IBM will get a call on the abends.  I could add more virtual storage 

to the controllers, but that will presumably only add cushion for more 

stops  restarts before it runs out of storage again.

Brian Nielsen


Re: VSWITCH Controller failover issue on z/VM 5.2

2006-04-24 Thread David Kreuter
after the successful fail over from primary to backup RDEVs for the OSAs, what 
state was reported for the primary RDEVs?
David


-Original Message-
From: The IBM z/VM Operating System on behalf of Brian Nielsen
Sent: Mon 4/24/2006 1:39 PM
To: IBMVM@LISTSERV.UARK.EDU
Subject: [IBMVM] VSWITCH Controller failover issue on z/VM 5.2
 
This weekend the LAN team upgraded the CISCO router connected to the OSA =

card.  The VSWITCH controller console shows the message:

   DTCOSD309W Received adapter-initiated Stop Lan

after which it tries to fail over from the primary OSA to the backup OSA.=
  
Eventually I see:

   DTCOSD306I Received adapter-initiated Start Lan

and the backup OSA addresses get started fine and assign all the IP 
addresses.

There are several more iterations of Stop Lan and Start Lan, but it is =

always on the backup OSA addresses, never on the primary OSA addresses.

So the question is: why didn't it ever try to restart on the primary OSA =

when the backup OSA received the Stop Lan?  If both OSA cards were the =

same it might not matter that much, but the primary is a Gig-OSA and the =

backup is a Fast-OSA.

-

Oh, and just to make it more fun, lastly there is a Stop Lan, the various=
 
normal messages about stopping, attempting to restart, and:

   AMPX036I ASSERTION FAILURE CHECKING ERROR
 TRACE BACK OF CALLED ROUTINES
ROUTINE  STMT AT ADDRESS IN MODULE
SPSM_BLKALLOCATE   3900D429D0   TCFPSM_FPSM
INITDCB4800DA079A   TCTOOSD_TOOSD
TOOSDINIT  3300DA2DF4   TCTOOSD_TOOSD
CALLINITRTN3000CB9506   TCPARSE_PARSETCP
PROCESSSTARTSTOPSTATEMENT  3300CAE37E   TCPARSE_PARSETCP
PARSEOPTION   29200CB3C00   TCPARSE_PARSETCP
RECEIVECONTROLLERMSGFROMCP 2500D9F010   TCTOOSD_TOOSD
TOIUCV18000D3D580   TCTOIUC_TOIUCV
Schedule 167000CFA118
MAIN-PROGRAM 1400C441FE   TCPIP
VSPASCAL 00E4D74A

   DTCOSD100E Insufficient Fixed Page Storage Pool storage

after which it tries to shutdown and then goes into repeated:

  AMPX015I ADDRESSING EXCEPTION

abends.  Needless to say there were a lot of dump files in spool space.

So, IBM will get a call on the abends.  I could add more virtual storage =

to the controllers, but that will presumably only add cushion for more =

stops  restarts before it runs out of storage again.

Brian Nielsen


Re: VSWITCH Controller failover issue on z/VM 5.2

2006-04-24 Thread Alan Altmark
On Monday, 04/24/2006 at 12:39 EST, Brian Nielsen 
[EMAIL PROTECTED] wrote:
 
 There are several more iterations of Stop Lan and Start Lan, but it is
 always on the backup OSA addresses, never on the primary OSA addresses.
 
 So the question is: why didn't it ever try to restart on the primary OSA
 when the backup OSA received the Stop Lan?  If both OSA cards were the
 same it might not matter that much, but the primary is a Gig-OSA and the
 backup is a Fast-OSA.

There is no permanent association of primary and backup in the 
VSWITCH.  There is just a list of available OSAs.  The VSWITCH starts with 
the first one and fails over until it finds a working one.  Whichever one 
is active is the primary.  The others are backups.  (You can see that 
in QUERY VSWITCH.)

So, I'm not 100% sure I'm reading your post correctly, but I would have 
expected identical behavior vis a vis the decision to fail over, without 
regard to which OSA was active and which were backups.  A StopLAN on the 
currently active OSA should have caused a failover.  If none of the 
backups worked (i.e. all OSAs unplugged at the same time), I would have 
expected CP to come back to the OSA that is currently 'active' and sit and 
wait for StartLAN on the active OSA or one of the backups.  At that point, 
whichever one comes up first would become the active OSA.

Of course, with the abend in the controller, bizzare things were obviously 
going on.  The Support Center definitely needs to look at it.

FWIW, if you want it to go back to the first OSA in the list, set up some 
system automation to do a SET VSWITCH DISCONNECT followed by a SET VSWITCH 
CONNECT when you get an adapter-initiated StartLAN on the desired device 
(as seen on the controller's console).  CP will go back to the beginning 
of the OSA list and start looking for a working device.

Alan Altmark
z/VM Development
IBM Endicott


Re: VSWITCH Controller failover issue on z/VM 5.2

2006-04-24 Thread Brian Nielsen
The primaries were listed as BACKUP.

Brian Nielsen


On Mon, 24 Apr 2006 14:21:11 -0400, David Kreuter [EMAIL PROTECTED]
resources.com wrote:

after the successful fail over from primary to backup RDEVs for the OSAs
, 
what state was reported for the primary RDEVs?
David


-Original Message-
From: The IBM z/VM Operating System on behalf of Brian Nielsen
Sent: Mon 4/24/2006 1:39 PM
To: IBMVM@LISTSERV.UARK.EDU
Subject: [IBMVM] VSWITCH Controller failover issue on z/VM 5.2
 
This weekend the LAN team upgraded the CISCO router connected to the OSA
 =

card.  The VSWITCH controller console shows the message:

   DTCOSD309W Received adapter-initiated Stop Lan

after which it tries to fail over from the primary OSA to the backup OSA
.=
  
Eventually I see:

   DTCOSD306I Received adapter-initiated Start Lan

and the backup OSA addresses get started fine and assign all the IP 
addresses.

There are several more iterations of Stop Lan and Start Lan, but it is =


always on the backup OSA addresses, never on the primary OSA addresses.

So the question is: why didn't it ever try to restart on the primary OSA
 =

when the backup OSA received the Stop Lan?  If both OSA cards were the =


same it might not matter that much, but the primary is a Gig-OSA and the
 =

backup is a Fast-OSA.

-

Oh, and just to make it more fun, lastly there is a Stop Lan, the variou
s=
 
normal messages about stopping, attempting to restart, and:

   AMPX036I ASSERTION FAILURE CHECKING ERROR
 TRACE BACK OF CALLED ROUTINES
ROUTINE  STMT AT ADDRESS IN MODULE
SPSM_BLKALLOCATE   3900D429D0   TCFPSM_FPSM
INITDCB4800DA079A   TCTOOSD_TOOSD
TOOSDINIT  3300DA2DF4   TCTOOSD_TOOSD
CALLINITRTN3000CB9506   TCPARSE_PARSETCP

PROCESSSTARTSTOPSTATEMENT  3300CAE37E   TCPARSE_PARSETCP

PARSEOPTION   29200CB3C00   TCPARSE_PARSETCP

RECEIVECONTROLLERMSGFROMCP 2500D9F010   TCTOOSD_TOOSD
TOIUCV18000D3D580   TCTOIUC_TOIUCV
Schedule 167000CFA118
MAIN-PROGRAM 1400C441FE   TCPIP
VSPASCAL 00E4D74A

   DTCOSD100E Insufficient Fixed Page Storage Pool storage

after which it tries to shutdown and then goes into repeated:

  AMPX015I ADDRESSING EXCEPTION

abends.  Needless to say there were a lot of dump files in spool space.

So, IBM will get a call on the abends.  I could add more virtual storage
 =

to the controllers, but that will presumably only add cushion for more =


stops  restarts before it runs out of storage again.

Brian Nielsen

=
===


Re: VSWITCH Controller failover issue on z/VM 5.2

2006-04-24 Thread Brian Nielsen
Thanks.  So whichever OSA does a Start Lan first becomes the current 
primary.  Perhaps someone knows if there's a way to force the CISCO route
r 
to start the preferred gig-OSA interface before the fast-OSA interface.

I accomplished going back to the desired OSA by autologging DTCVSW1 (whic
h 
is normally the active one, but had abended and was no longer logged on).
  
Then forcing and autologging DTCVSW2 (which had taken over after DTCVSW1 

abended).  A side benefit was a clean virtual address space for each.

Brian Nielsen

On Mon, 24 Apr 2006 14:39:53 -0400, Alan Altmark [EMAIL PROTECTED]
 
wrote:

On Monday, 04/24/2006 at 12:39 EST, Brian Nielsen
[EMAIL PROTECTED] wrote:

 There are several more iterations of Stop Lan and Start Lan, but it is

 always on the backup OSA addresses, never on the primary OSA addresses
.

 So the question is: why didn't it ever try to restart on the primary O
SA
 when the backup OSA received the Stop Lan?  If both OSA cards were the

 same it might not matter that much, but the primary is a Gig-OSA and t
he
 backup is a Fast-OSA.

There is no permanent association of primary and backup in the
VSWITCH.  There is just a list of available OSAs.  The VSWITCH starts wi
th
the first one and fails over until it finds a working one.  Whichever on
e
is active is the primary.  The others are backups.  (You can see tha
t
in QUERY VSWITCH.)

So, I'm not 100% sure I'm reading your post correctly, but I would have
expected identical behavior vis a vis the decision to fail over, without

regard to which OSA was active and which were backups.  A StopLAN on the

currently active OSA should have caused a failover.  If none of the
backups worked (i.e. all OSAs unplugged at the same time), I would have
expected CP to come back to the OSA that is currently 'active' and sit a
nd
wait for StartLAN on the active OSA or one of the backups.  At that poin
t,
whichever one comes up first would become the active OSA.

Of course, with the abend in the controller, bizzare things were obvious
ly
going on.  The Support Center definitely needs to look at it.

FWIW, if you want it to go back to the first OSA in the list, set up som
e
system automation to do a SET VSWITCH DISCONNECT followed by a SET VSWIT
CH
CONNECT when you get an adapter-initiated StartLAN on the desired device

(as seen on the controller's console).  CP will go back to the beginning

of the OSA list and start looking for a working device.

Alan Altmark
z/VM Development
IBM Endicott

=