Something about Malaysia, first the airplanes... now BGP leaks? On Fri, Jun 12, 2015 at 10:32 AM, Martin Millnert <milln...@gmail.com> wrote:
> Dear Level3, > > The Internet is a cooperative effort, and it works well only when its > participants take constructive actions to address errors and remedy > problems. > Your position as a major Internet Carrier bestows upon you a certain > degree of responsibility for the correct operation of the Internet all > across (and beyond) the planet. You have many customers. Customers will > always occasionally make mistakes. You as a major Internet Carrier have > a responsibility to limit, not amplify, your customers' mistakes. > Other major carriers implement technical measures that severely limits > the damages from customer mistakes from having global impact. > Other major carriers also implement operational procedures in addition > to technical measures. > In combination, these measures drastically reduce the outage-hours as a > result of customer configuration errors. > > At 08:44 UTC on Friday 12th of June, one of your transit customers, > Telekom Malaysia (AS4788) began announcing the full Internet table back > to you, which you accepted and propagated to your peers and customers, > causing global outages for close to 3 hours. > [ https://twitter.com/DynResearch/status/609340592036970496 ] > During this 3 hour window, it appears (from your own service outage > reports) that you did nothing to stop the global Internet outage, but > that Telekom Malaysia themselves eventually resolved it. This lack of > action on your end, and your disregard for the correct operation of the > global Internet is astonishing. These mistakes do not need to happen. > AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the > Internet. You accepted multiple hundred thousand prefixes from them - a > max prefix setting would have severely limited the damage. We expect > that these are your practices as well, but they failed. When they do, it > should not take ~3 hours to shut down the session(s). > > Many operators, in despair, turned down their peering sessions with you > once it was clear you were causing the outages and no immediate fix was > in sight. This improved the situation for some - but not all did. Had > you deployed proper IRR-filtering to filter the bad announcements the > impact would've been far less critical. > > As a direct consequence of your ~3 hours of inaction, as a local > example, Swedish payment terminals were experiencing problems all over > the country. The Swedish economy was directly affected by your inaction. > There were queues when I was buying lunch! Imagine the food rage. The > situation was probably similar at other places around the globe where > people were awake. > > Operators around the planet are curious: > - Did Level3 not detect or understand that it was causing global > Internet outages for ~3 hours? > - If Level3 did in fact detect or understand it was causing global > Internet outages, why did it not properly and immediately remedy the > situation? > - What is Level3 going to do to address these questions and begin work > on restoring its credibility as a carrier? > > We all understand that mistakes do happen (in applying customer > interface templates, etc.). However the Internet is all too pervasive in > everyday life today for anything but swift action by carriers to remedy > breakage after the fact. It is absolutely not sufficient to let a > customer spend 3 hours to detect and fix a situation like this one. It > is unacceptable that no swift action was taken on your end to limit the > global routing issues you caused. > > Sincerely, > Martin Millnert > Member of Internet Community - no carrier / ISP affiliation. >