impossible circuit

Jon Lewis Sun, 10 Aug 2008 20:16:10 -0700

After all the messages recently about how to fix DNS, I was seriouslytempted to title this messsage "And now, for something completelydifferent", but impossible circuit is more descriptive.

Before you read further, I need everyone to put on their thinking WAYoutside the box hats. I've heard from enough people already that I'm nutsand what I'm seeing can't happen, so it must not be happening...eventhough we see the results of it happening.

I've got this private line DS3. It connects cisco 7206 routers inOrlando (at our data center) and in Ocala (a colo rack in the Embarq CO).

According to the DLR, it's a real circuit, various portions of it ridevarying sized OC circuits, and then it's handed off to us at each end theusual way (copper/coax) and plugged into PA-2T3 cards.

Last Tuesday, at about 2:30PM, "something bad happened." We saw a seriousjump in traffic to Ocala, and in particular we noticed one customer'sconnection (a group of load sharing T1s) was just totally full. Wequickly assumed it was a DDoS aimed at that customer, but looking at thetraffic, we couldn't pinpoint anything that wasn't expected flows.

Then we noticed the really weird stuff. Pings to anything in Ocalaresponded with multiple dupes and ttl exceeded messages from a Level3 IP.Traceroutes to certain IPs in Ocala would get as far our Ocala router,then inexplicably hop onto Sprintlink's network, come back to us over ourLevel3 transit connection, get to Ocala, then hop over to Sprintlinkagain, repeating that loop as many times as max TTL would permit. Pingsfrom router to router crossing just the DS3 would work, but we'd see 10duplicate packets for every 1 expected packet. BTW, the cisco CLI hidesdupes unless you turn on ip icmp debugging.

I've seen some sort of similar things (though contained within an AS) withMPLS and routing misconfigurations, but traffic jumping off our network(to a network to which we're not directly connected) was seeminglyimpossible. We did all sorts of things to troubleshoot it (studied ourrouter configs in rancid, temporarily shut every interface on the Ocalaside other than the DS3, changed IOS versions, changed out the hardware,opened a ticket with cisco TAC) but then it occurred to me, that iftraffic was actually jumping off our network and coming back in viaLevel3, I could see/block at least some of that using an ACL on ourinterface to Level3. How do you explain it, when you ping the remote endof a DS3 interface with a single echo request packet and see 5 copies ofthat echo request arrive at one of your transit provider interfaces?

Here's a typical traceroute with the first few hops (from my home internetconnection) removed. BTW, hop 9 is a customer router convenientlyconfigured with no ip unreachables.


 7  andc-br-3-f2-0.atlantic.net (209.208.9.138)  47.951 ms  56.096 ms  56.154 ms
 8  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  56.199 ms  56.320 ms  
56.196 ms
 9  * * *
10  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  80.774 ms  81.030 ms  
81.821 ms
11  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  75.731 ms  75.902 ms  
77.128 ms
12  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  46.548 ms  53.200 ms  
45.736 ms
13  vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.918 ms 
vlan79.csw2.Washington1.Level3.net (4.68.17.126)  55.438 ms 
vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.693 ms
14  ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137)  48.935 ms 
ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  49.317 ms 
ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141)  48.865 ms
15  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  59.642 ms  56.278 ms  56.671 ms
16  ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2)  47.401 ms  62.980 ms  
62.640 ms
17  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  40.300 ms  40.101 ms  
42.690 ms
18  ae-6-6.car1.Orlando1.Level3.net (4.69.133.77)  40.959 ms  40.963 ms  41.016 
ms
19  unknown.Level3.net (63.209.98.66)  246.744 ms  240.826 ms  239.758 ms
20  andc-br-3-f2-0.atlantic.net (209.208.9.138)  39.725 ms  37.751 ms  42.262 ms
21  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  43.524 ms  45.844 ms  
43.392 ms
22  * * *
23  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  63.752 ms  61.648 ms  
60.839 ms
24  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  66.923 ms  65.258 ms  
70.609 ms
25  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  67.106 ms  93.415 ms  
73.932 ms
26  vlan99.csw4.Washington1.Level3.net (4.68.17.254)  88.919 ms  75.306 ms 
vlan79.csw2.Washington1.Level3.net (4.68.17.126)  75.048 ms
27  ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  69.508 ms  68.401 ms 
ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133)  79.128 ms
28  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  64.048 ms  67.764 ms  67.704 ms
29  ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18)  68.372 ms  67.025 ms  
68.162 ms
30  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  65.112 ms  65.584 ms  
65.525 ms

Our circuit provider's support people have basically just maintained thatthis behavior isn't possible and so there's nothing they can do about it.i.e. that the problem has to be something other than the circuit.

I got tired of talking to their brick wall, so I contacted Sprint and wasable to confirm with them that the traffic in question really wasinexplicably appearing on their network...and not terribly closegeographically to the Orlando/Ocala areas.

So, I have a circuit that's bleeding duplicate packets onto an unrelatedIP network, a circuit provider who's got their head in the sand and keepstelling me "this can't happen, we can't help you", and customers who weregetting tired of receiving all their packets in triplicate (or more)saturating their connections and confusing their applications. After awhile, I had to give up on finding the problem and focus on just making itstop. After trying a couple of things, the solution I found was to changethe encapsulation we use at each end of the DS3. I haven't gottenconfirmation of this from Sprint, but I assume they're now seeing massiveinput errors one the one or more circuits where our packets were/areappearing. The important thing (for me) is that this makes the packetsinvalid to Sprint's routers and so it keeps them from forwarding thepackets to us. Cisco TAC finally got back to us the day after I "fixed"the circuit...but since it was obviously not a problem with our ciscogear, I haven't pursued it with them.

The only things I can think of that might be the cause aremisconfiguration in a DACS/mux somewhere along the circuit path or perhapsa mishandled lawful intercept. I don't have enough experience with eitheror enough access to the systems that provide the circuit to do any morethan speculate. Has anyone else ever seen anything like this?

If someone from Level3 transport can wrap their head around this, I'd loveto know what's really going on...but at least it's no longer an urgentproblem for me.


----------------------------------------------------------------------
 Jon Lewis                   |  I route
 Senior Network Engineer     |  therefore you are
 Atlantic Net                |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________

impossible circuit

Reply via email to