A strange one indeed, especially if you have no connectivity to Sprint
there.
Since your fix was layer 2 you might be onto something. And you have the
time it happened, and as we all know - somebody changed somethin' even
if they won't fess up.
I'm trying to think how you could cause something like that with a
conventional DACS or one of the newer packet friendly types that might
be more prone to a layer 2 bug since software is fairly new. Course it
would make more sense if it was just crossed Ethernet rather than DS3
frames but who knows. There are plenty of carriers putting them in
(including us).
Shame it's not the kind of thing you can duplicate without being service
affecting.
George
Jon Lewis wrote:
After all the messages recently about how to fix DNS, I was seriously
tempted to title this messsage "And now, for something completely
different", but impossible circuit is more descriptive.
Before you read further, I need everyone to put on their thinking WAY
outside the box hats. I've heard from enough people already that I'm
nuts and what I'm seeing can't happen, so it must not be
happening...even though we see the results of it happening.
I've got this private line DS3. It connects cisco 7206 routers in
Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).
According to the DLR, it's a real circuit, various portions of it ride
varying sized OC circuits, and then it's handed off to us at each end
the usual way (copper/coax) and plugged into PA-2T3 cards.
Last Tuesday, at about 2:30PM, "something bad happened." We saw a
serious jump in traffic to Ocala, and in particular we noticed one
customer's connection (a group of load sharing T1s) was just totally
full. We quickly assumed it was a DDoS aimed at that customer, but
looking at the traffic, we couldn't pinpoint anything that wasn't
expected flows.
Then we noticed the really weird stuff. Pings to anything in Ocala
responded with multiple dupes and ttl exceeded messages from a Level3
IP. Traceroutes to certain IPs in Ocala would get as far our Ocala
router, then inexplicably hop onto Sprintlink's network, come back to us
over our Level3 transit connection, get to Ocala, then hop over to
Sprintlink again, repeating that loop as many times as max TTL would
permit. Pings from router to router crossing just the DS3 would work,
but we'd see 10 duplicate packets for every 1 expected packet. BTW, the
cisco CLI hides dupes unless you turn on ip icmp debugging.
I've seen some sort of similar things (though contained within an AS)
with MPLS and routing misconfigurations, but traffic jumping off our
network (to a network to which we're not directly connected) was
seemingly impossible. We did all sorts of things to troubleshoot it
(studied our router configs in rancid, temporarily shut every interface
on the Ocala side other than the DS3, changed IOS versions, changed out
the hardware, opened a ticket with cisco TAC) but then it occurred to
me, that if traffic was actually jumping off our network and coming back
in via Level3, I could see/block at least some of that using an ACL on
our interface to Level3. How do you explain it, when you ping the
remote end of a DS3 interface with a single echo request packet and see
5 copies of that echo request arrive at one of your transit provider
interfaces?
Here's a typical traceroute with the first few hops (from my home
internet connection) removed. BTW, hop 9 is a customer router
conveniently configured with no ip unreachables.
7 andc-br-3-f2-0.atlantic.net (209.208.9.138) 47.951 ms 56.096 ms
56.154 ms
8 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 56.199 ms 56.320
ms 56.196 ms
9 * * *
10 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 80.774 ms 81.030
ms 81.821 ms
11 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 75.731 ms 75.902
ms 77.128 ms
12 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 46.548 ms
53.200 ms 45.736 ms
13 vlan69.csw1.Washington1.Level3.net (4.68.17.62) 42.918 ms
vlan79.csw2.Washington1.Level3.net (4.68.17.126) 55.438 ms
vlan69.csw1.Washington1.Level3.net (4.68.17.62) 42.693 ms
14 ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137) 48.935 ms
ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 49.317 ms
ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141) 48.865 ms
15 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 59.642 ms 56.278 ms
56.671 ms
16 ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2) 47.401 ms 62.980
ms 62.640 ms
17 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 40.300 ms 40.101
ms 42.690 ms
18 ae-6-6.car1.Orlando1.Level3.net (4.69.133.77) 40.959 ms 40.963 ms
41.016 ms
19 unknown.Level3.net (63.209.98.66) 246.744 ms 240.826 ms 239.758 ms
20 andc-br-3-f2-0.atlantic.net (209.208.9.138) 39.725 ms 37.751 ms
42.262 ms
21 ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98) 43.524 ms 45.844
ms 43.392 ms
22 * * *
23 sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174) 63.752 ms 61.648
ms 60.839 ms
24 sl-st20-ash-10-0.sprintlink.net (144.232.20.152) 66.923 ms 65.258
ms 70.609 ms
25 te-10-1-0.edge2.Washington4.level3.net (4.68.63.209) 67.106 ms
93.415 ms 73.932 ms
26 vlan99.csw4.Washington1.Level3.net (4.68.17.254) 88.919 ms 75.306
ms vlan79.csw2.Washington1.Level3.net (4.68.17.126) 75.048 ms
27 ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129) 69.508 ms
68.401 ms ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133) 79.128 ms
28 ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85) 64.048 ms 67.764 ms
67.704 ms
29 ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18) 68.372 ms 67.025
ms 68.162 ms
30 ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149) 65.112 ms 65.584
ms 65.525 ms
Our circuit provider's support people have basically just maintained
that this behavior isn't possible and so there's nothing they can do
about it. i.e. that the problem has to be something other than the circuit.
I got tired of talking to their brick wall, so I contacted Sprint and
was able to confirm with them that the traffic in question really was
inexplicably appearing on their network...and not terribly close
geographically to the Orlando/Ocala areas.
So, I have a circuit that's bleeding duplicate packets onto an unrelated
IP network, a circuit provider who's got their head in the sand and
keeps telling me "this can't happen, we can't help you", and customers
who were getting tired of receiving all their packets in triplicate (or
more) saturating their connections and confusing their applications.
After a while, I had to give up on finding the problem and focus on just
making it stop. After trying a couple of things, the solution I found
was to change the encapsulation we use at each end of the DS3. I
haven't gotten confirmation of this from Sprint, but I assume they're
now seeing massive input errors one the one or more circuits where our
packets were/are appearing. The important thing (for me) is that this
makes the packets invalid to Sprint's routers and so it keeps them from
forwarding the packets to us. Cisco TAC finally got back to us the day
after I "fixed" the circuit...but since it was obviously not a problem
with our cisco gear, I haven't pursued it with them.
The only things I can think of that might be the cause are
misconfiguration in a DACS/mux somewhere along the circuit path or
perhaps a mishandled lawful intercept. I don't have enough experience
with either or enough access to the systems that provide the circuit to
do any more than speculate. Has anyone else ever seen anything like this?
If someone from Level3 transport can wrap their head around this, I'd
love to know what's really going on...but at least it's no longer an
urgent problem for me.
----------------------------------------------------------------------
Jon Lewis | I route
Senior Network Engineer | therefore you are
Atlantic Net |
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________