[domains-gen] Tucows Network Outage - The Debrief

Ken Schafer Fri, 05 May 2006 14:33:18 -0700

Hi all,

I've been heading up the marketing team here at Tucows for all of  
four weeks now and was looking for an opportunity to jump into the  
conversation on the discussion lists when the opportunity presented  
itself. I guess I've got my wish (and I'll watch what I'm wishing for  
next time).


This post is a bit longer than you may be used to and it is  
definitely more than I'd typically put in one post but I know many of  
you are looking for details on what exactly happened to Tucows  
earlier this week along with our plans on how to minimize the chances  
of it happening again.

I want to share as much as we comfortably can without tipping our  
hand to the attackers.  They might be able to figure out our response  
but that doesn't mean we have to make it easier them.   This then is  
a GENERAL discussion of what happened, the timeline and our planned  
response. In general we also share our experience at a DETAILED level  
directly with other sites, service providers, and partners who are or  
may be impacted by the same type of attacks. We've decided to do this  
NOC-to-NOC and exec-to-exec rather than publicly.  If you have  
thoughts on how we communicate and co-ordinate defences without  
telling attackers what counter-measures we're implementing, I'd be  
happy to hear them.

Hopefully these details will give you answers to most of your  
questions.  Feel free to connect with me or others at Tucows if I've  
missed something.

THE SITUATION
=======================
A site using our Managed DNS Service was the target of an aggressive  
attack.  Tucows was not registrar of the domain under attack.  We  
believe the attack was a SYN flood attack against our DNS servers,  
but not specifically the DNS service.  We believe the attackers  
intent was to make the target's site inaccessible by making its DNS  
unavailable.  This is conjecture as we have no knowledge of the  
attacker's true intent.


TIMELINE
=======================

(All times are Eastern Daylight Time which is -0400 UTC)

Wednesday May 3, 2006

12:30 - Internal network issues escalate to the Network Group within  
Operations. Network Operations determines the issue is with our  
Collocation Provider's network. Collocation Provider informs Tucows  
that other hosted customers are experiencing similar issues.

13:00 - The network status page is updated to show degraded  
performance for all Tucows services (excluding Hosted E-mail).

13:30 - Collocation Provider forms a SWAT team involving their  
firewall department, network and one Upstream Provider to  
troubleshoot further, no ETA is provided.

13:30 - 16:00 - Tucows maintains close contact with Collocation  
Provider for status updates.

16:00 - Collocation Provider informs Operations that the network  
issue should be resolved as two of their Upstream Network Providers  
resolved separate network incidents.

16:00 - 16:30 - Operations begins validating that Tucows services  
have returned to normal and notes an inbound bandwidth increase on  
the Managed DNS servers during the Collocation Provider network  
outage. Operations determines the system (Specifically NS1 and  
NS2.mdnsservice.com) was under DDOS attack. The tertiary server was  
not under attack. Upstream Providers confirm this by escalating this  
incident.

16:30 - 18:00 - Operations attempts several techniques to limit the  
attack but are unsuccessful due to the sophistication of the attack.

18:00 - Collocation Provider contacts Operations to inform that they  
are in the process of blocking all Tucows IPs as a result of the DDOS  
attack flooding their network.  Tucows moves to highest escalation  
level (i.e. Elliot).

18:00 - 23:00 - Elliot and operations join a conference call with  
Collocation Provider to work towards limiting or removing the black  
listing of the Tucows IP range.

20:00 - Operations installs a filtering rule that succeeds in  
reducing the inbound traffic.

23:00 - Through negotiation with Upstream Network Providers and the  
progress Operations made, the Upstream Network Provider reduces the  
black list to include only the IP's for the managed DNS servers (NS1  
and NS2).  Performance on all Tucows services (with the exception of  
Managed DNS) improve and Operations begins validating services and  
performing post mortem recovery steps. While validating some RAs  
still experience service degradation due to exceeding the registrar's  
max connection rates due to network latency.  Operations works with  
the registrars to restore service to the RA's. Operations determines  
they would re-IP the two Managed DNS servers to recover from the  
attack and get removed from the Collocation Provider's black list.

Tuesday May 4, 2006

00:00 - All RA's operating normally.

01:00 - The status page updated to reflect the restoration of service  
for all but Managed DNS.

03:30 - Operations completed the re-IPing and glue record update for  
NS1 and NS2. This restores service to the majority of Managed DNS  
customers.

03:30 - 05:00 - Operations performs post mortem Managed DNS activities.

05:00 - onward - Request the target's registrar move the target's  
record away from Tucows. Begin contacting major ISPs to update their  
records to reflect the changes made to NS1/NS2. Operations continues  
to monitor the network and services for further issues.  Some  
services are not visible to end-users awaiting DNS updates to  
propagate at their ISPs.


IMMEDIATE ACTIONS
=======================

1. Before this attack we were already working on a redeployment of  
our Managed DNS solution with an end of May delivery target.  The new  
Managed DNS solution deploys 5 servers in a load-balanced  
configuration into 3 separate locations (15 servers total, 5 to each  
of Toronto, Denver and London England).  These servers are being  
shipped today (on schedule) and should be live by the end of the  
month (probably sooner).

2. We also already had in progress plans to bring in alternate  
bandwidth to our Collocation Provider.  This will probably be live by  
mid-June.


LONG-TERM ACTIONS
=======================

1. Our Network Engineers are looking at additional routing, network  
design, and device solutions we can implement to help avert attacks  
in the future.

2. We (okay I) will be doing a complete review of all customer-facing  
communications - both in "situation mode" and for regular  
communications. Our goal it to ensure that we are always  
communicating in a clear and timely manner in your preferred channel.


I'd like to wrap up by reiterating a big thank you for your  
understanding during this unusual occurrence.  As someone new to  
Tucows it was thrilling to see the entire team spring into action to  
get this fixed as quickly as possible and equally heartening to see  
all the understanding and offers of support we received from  
customers and friends in the industry.

Once again, please let me know (off-list or in public as you see fit)  
if you have suggestions on how we can learn from this situation to  
improve communication with you in the future.

Cheers,

Ken Schafer
VP, Marketing
Tucows Inc.


_______________________________________________
domains-gen mailing list
[email protected]
http://discuss.tucows.com/mailman/listinfo/domains-gen

[domains-gen] Tucows Network Outage - The Debrief

Reply via email to