Hi all, I've been heading up the marketing team here at Tucows for all of four weeks now and was looking for an opportunity to jump into the conversation on the discussion lists when the opportunity presented itself. I guess I've got my wish (and I'll watch what I'm wishing for next time).
This post is a bit longer than you may be used to and it is definitely more than I'd typically put in one post but I know many of you are looking for details on what exactly happened to Tucows earlier this week along with our plans on how to minimize the chances of it happening again. I want to share as much as we comfortably can without tipping our hand to the attackers. They might be able to figure out our response but that doesn't mean we have to make it easier them. This then is a GENERAL discussion of what happened, the timeline and our planned response. In general we also share our experience at a DETAILED level directly with other sites, service providers, and partners who are or may be impacted by the same type of attacks. We've decided to do this NOC-to-NOC and exec-to-exec rather than publicly. If you have thoughts on how we communicate and co-ordinate defences without telling attackers what counter-measures we're implementing, I'd be happy to hear them. Hopefully these details will give you answers to most of your questions. Feel free to connect with me or others at Tucows if I've missed something. THE SITUATION ======================= A site using our Managed DNS Service was the target of an aggressive attack. Tucows was not registrar of the domain under attack. We believe the attack was a SYN flood attack against our DNS servers, but not specifically the DNS service. We believe the attackers intent was to make the target's site inaccessible by making its DNS unavailable. This is conjecture as we have no knowledge of the attacker's true intent. TIMELINE ======================= (All times are Eastern Daylight Time which is -0400 UTC) Wednesday May 3, 2006 12:30 - Internal network issues escalate to the Network Group within Operations. Network Operations determines the issue is with our Collocation Provider's network. Collocation Provider informs Tucows that other hosted customers are experiencing similar issues. 13:00 - The network status page is updated to show degraded performance for all Tucows services (excluding Hosted E-mail). 13:30 - Collocation Provider forms a SWAT team involving their firewall department, network and one Upstream Provider to troubleshoot further, no ETA is provided. 13:30 - 16:00 - Tucows maintains close contact with Collocation Provider for status updates. 16:00 - Collocation Provider informs Operations that the network issue should be resolved as two of their Upstream Network Providers resolved separate network incidents. 16:00 - 16:30 - Operations begins validating that Tucows services have returned to normal and notes an inbound bandwidth increase on the Managed DNS servers during the Collocation Provider network outage. Operations determines the system (Specifically NS1 and NS2.mdnsservice.com) was under DDOS attack. The tertiary server was not under attack. Upstream Providers confirm this by escalating this incident. 16:30 - 18:00 - Operations attempts several techniques to limit the attack but are unsuccessful due to the sophistication of the attack. 18:00 - Collocation Provider contacts Operations to inform that they are in the process of blocking all Tucows IPs as a result of the DDOS attack flooding their network. Tucows moves to highest escalation level (i.e. Elliot). 18:00 - 23:00 - Elliot and operations join a conference call with Collocation Provider to work towards limiting or removing the black listing of the Tucows IP range. 20:00 - Operations installs a filtering rule that succeeds in reducing the inbound traffic. 23:00 - Through negotiation with Upstream Network Providers and the progress Operations made, the Upstream Network Provider reduces the black list to include only the IP's for the managed DNS servers (NS1 and NS2). Performance on all Tucows services (with the exception of Managed DNS) improve and Operations begins validating services and performing post mortem recovery steps. While validating some RAs still experience service degradation due to exceeding the registrar's max connection rates due to network latency. Operations works with the registrars to restore service to the RA's. Operations determines they would re-IP the two Managed DNS servers to recover from the attack and get removed from the Collocation Provider's black list. Tuesday May 4, 2006 00:00 - All RA's operating normally. 01:00 - The status page updated to reflect the restoration of service for all but Managed DNS. 03:30 - Operations completed the re-IPing and glue record update for NS1 and NS2. This restores service to the majority of Managed DNS customers. 03:30 - 05:00 - Operations performs post mortem Managed DNS activities. 05:00 - onward - Request the target's registrar move the target's record away from Tucows. Begin contacting major ISPs to update their records to reflect the changes made to NS1/NS2. Operations continues to monitor the network and services for further issues. Some services are not visible to end-users awaiting DNS updates to propagate at their ISPs. IMMEDIATE ACTIONS ======================= 1. Before this attack we were already working on a redeployment of our Managed DNS solution with an end of May delivery target. The new Managed DNS solution deploys 5 servers in a load-balanced configuration into 3 separate locations (15 servers total, 5 to each of Toronto, Denver and London England). These servers are being shipped today (on schedule) and should be live by the end of the month (probably sooner). 2. We also already had in progress plans to bring in alternate bandwidth to our Collocation Provider. This will probably be live by mid-June. LONG-TERM ACTIONS ======================= 1. Our Network Engineers are looking at additional routing, network design, and device solutions we can implement to help avert attacks in the future. 2. We (okay I) will be doing a complete review of all customer-facing communications - both in "situation mode" and for regular communications. Our goal it to ensure that we are always communicating in a clear and timely manner in your preferred channel. I'd like to wrap up by reiterating a big thank you for your understanding during this unusual occurrence. As someone new to Tucows it was thrilling to see the entire team spring into action to get this fixed as quickly as possible and equally heartening to see all the understanding and offers of support we received from customers and friends in the industry. Once again, please let me know (off-list or in public as you see fit) if you have suggestions on how we can learn from this situation to improve communication with you in the future. Cheers, Ken Schafer VP, Marketing Tucows Inc. _______________________________________________ domains-gen mailing list [email protected] http://discuss.tucows.com/mailman/listinfo/domains-gen
