On Tue, 10 Mar 2009, Peter Rathlev wrote:

On Tue, 2009-03-10 at 11:32 +0100, Andrew Yourtchenko wrote:
if it is merely a new standby that is coming up, the active should not
stop forwarding the traffic.

That's what I would've assumed too. :-) I do seem to remember that we've
seen this before though, when we had random reboots (CSCse15099 if I
remember correctly) of secondary units resulting in downtime.

ahha. interesting - so there's probably also something specific to the setup that might be contributing to seeing this.


I'd watch out for "logging standby" - I vaguely remember there were
some issues where the newly coming up box would try to send the
traffic with the wrong IP/MAC and/or send the gratuitous arp with the
wrong info in there.

Especially may be true if you are bringing up primary as standby - at
this moment the secondary/active is forwarding the traffic using the
primary's mac addresses.

Standby logging is disabled on all contexts, so this shouldn't be the
issue. Of course if the module coming up sends out gratuitous ARPs this
could break things with the way PIX/FWSM/ASA does HA, using the same
MAC-address.

I had thought that the unit coming up would start by looking at the
failover interface(s) to see if there is already an active unit, and
then start acting as active/standby depending of what it hears.


That's precisely how it is supposed to work :-)

Otherwise a crashing/rebooting primary unit would always introduce
downtime when coming up again.

Of course, interesting would be to check if indeed this is on all the
contexts or only some of them, etc.

The log buffer (logging errors) on the contexts on the standby unit
(which is the configured primary) didn't say "'

From the sys context on the standby:

Mar 09 2009 16:41:04: %FWSM-4-411003: Interface statefullfailover, changed 
state to administratively up
Mar 09 2009 16:41:05: %FWSM-5-504001: Security context admin was added to the 
system
...
Mar 09 2009 16:41:07: %FWSM-5-504001: Security context sample was added to the 
system
Mar 09 2009 16:41:26: %FWSM-1-709006: (Primary) End Configuration Replication 
(STB)
Mar 09 2009 16:42:02: %FWSM-6-210022: LU missed 4837568 updates

From one of the contexts, still the standby unit:

Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface 
internet
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface 
internet waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface 
aars_pro
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface 
aars_pro waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface 
inside
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface inside 
waiting
Mar 09 2009 16:41:35: %FWSM-1-105006: (Primary) Link status 'Up' on interface 
aars_interfw
Mar 09 2009 16:41:35: %FWSM-1-105003: (Primary) Monitoring on interface 
aars_interfw waiting
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface 
internet normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface 
aars_pro normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface inside 
normal
Mar 09 2009 16:41:44: %FWSM-1-105004: (Primary) Monitoring on interface 
aars_interfw normal

So the contexts weren't really activated (with "Up" interfaces) during
all the downtime, just at the end. To me that seems to suggest that it's
not just simply that it "steals" the traffic for the interfaces.

Right. That was why I was wondering whether indeed "all" (as in "absolutely all, i swear" :) the traffic was affected, or only some part of it, and with what timing. Also interesting thing to check if the existing TCP connections continue to run ok (so then the problem area could be isolated to session path and up). Of course, in the real-world scenario there's frequently simply not enough time to look at those details, but they might be definitely helpful.


It seems we have to try and replicate it in the lab to find out what
actually happened. :-)

Yes, that would be most ideal - if this issue is reproducible in the lab, then a case to further nail it down would be the way to go.

cheers,
andrew
_______________________________________________
cisco-nsp mailing list  cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/

Reply via email to