Jeff,

> I can fix them later, maybe even after I've had time to fully analyze the
> problem and get a software update from my vendor.

Well that assumes you have even noticed the problem in the first place.

On the point of flapping - completely agree. But the knob - already
available in some implementations - not to flap, but to keep the
session down till manual intervention - is completely different thing
and this is completely safe solution from protocol correctness pov.

--

Yes I understand your motivations, but the problem with BGP doing
things like treat-as-withdraw by default are really not what you are
describing.

Cheers,
R.




On Thu, Jan 3, 2013 at 10:19 PM, Jeff Wheeler <j...@inconcepts.biz> wrote:
> On Thu, Jan 3, 2013 at 3:18 PM, Robert Raszuk <rob...@raszuk.net> wrote:
>> How are you going to clean the NLRIs in your network (both transit or
>> stub) which were withdrawn in the messages your BGP implementation
>> declared "bad" and decided to ignore ?
>
> I can fix them later, maybe even after I've had time to fully analyze the
> problem and get a software update from my vendor.  Maybe I'll try a refresh
> or a session-reset, but I won't be at the mercy of repeatedly flapping
> session and phone ringing off the hook with angry customers!
>
> A lot of folks are thinking about this problem in the context of the big
> carrier who doesn't want a hard-to-diagnose problem of 1 RIB entry being
> wrong.  That's okay, it is one way to think about it.
>
> A second way to think of it is as a small/regional ISP.  If one or more of
> his transits are flapping because of a bad path on the DFZ, that is going to
> cost him money and customers.  If he has no way to mitigate it, he is at the
> mercy of external parties.  He could just use "ignore bad messages" and at
> least stop bleeding money.  He does not care if he can't reach  5 /24s at
> LANL, they are unimportant to him.  What is important is if he has any
> customers left next week.
>
> A third way is the small- or medium-datacenter network.  Imagine you are a
> typical small/medium shop and you have some Cisco/Juniper/Brocade stuff for
> your ASBRs and your core, but you bought a bunch of RainbowPoop Router Co
> switches for your racks, because they are inexpensive and they support EVPN,
> L3VPN, VPLS, or some other feature you want but Cisco/Juniper/Brocade don't
> put into their inexpensive product.
>
> So your network looks like this:
>
> ISP1    ISP2
>
>   CISCO  JUNIPER
>   |    \/
>   |    /\                \   |
>   |   /  \                \  |
>   TOR1    TOR2    ....    TOR99
>
> Now imagine your JUNIPER supports NewVpnThing and that's a feature you
> decided to use on the RainbowPoop TOR devices.  But TOR1 sends a bad BGP
> update.  JUNIPER knows about NewVpnThing and sees a bad BGP attribute (that
> it recognizes) so it does whatever the NewVpnThing spec says, and tears down
> the session to TOR1.
>
> CISCO on the other hand, does not know about NewVpnThing so this router
> doesn't even understand the update is bad.  It just passes it along to TOR2
> .. TOR99.  Now those boxes all tear down their session to the CISCO.  Then
> they re-establish.  Then they go down again.  They keep on doing this and
> the network is freaking out.
>
> By the time your in-house clue notices, your symptom is that 99 identical
> TORs are flapping their BGP to your CISCO.  You probably don't even notice
> the 1 TOR that is flapping to JUNIPER.  Maybe JUNIPER even logs something
> helpful but you may not investigate it for a while.
>
> So your CISCO which is following the base spec is carrying a buggy update to
> your 99 other RainbowPoop TORs and they are all failing.  Your JUNIPER which
> knows about the NewVpnThing is following its spec and protecting the other
> TORs from this problem, but it is probably not helpful since your network is
> in chaos from all the flapping.
>
> What do you do?  Call vendor support.  Probably for CISCO and RainbowPoop.
> Well, now you are expecting the TAC of Cisco and the TAC of RainbowPoop to
> cooperate, which they'll have trouble doing; and it may take ages before
> anyone identifies the root cause of the problem is really TOR1.
>
> There are going to be a lot of RainbowPoop routers in the future, and many
> of them may use BGP.  We should make BGP more robust.
>
>
> --
> Jeff S Wheeler <j...@inconcepts.biz>
> Sr Network Operator  /  Innovative Network Concepts
>
> _______________________________________________
> Idr mailing list
> i...@ietf.org
> https://www.ietf.org/mailman/listinfo/idr
>
_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow

Reply via email to