Re: Better description of what happened

2021-10-06 Thread Hugo Slabbert
>
> Do we actually know this wrt the tools referred to in "the total loss of
> DNS broke many of the tools we’d normally use to investigate and resolve
> outages like this."?  Those tools aren't necessarily located in any of
> the remote data centers, and some of them might even refer to resources
> outside the facebook network.


Yea; that's kinda the thinking here.  Specifics are scarce, but there were
notes re: the OOB for instance also being unusable.  The questions are how
much that was due to dependence of the OOB network on the production side,
and how much DNS being notionally available might have supported getting
things back off the ground (if it would just provide mgt addresses for key
devices, or if perhaps there was a AAA dependency that also rode on DNS).
This isn't to say there aren't other design considerations in play to make
that fly (e.g. if DNS lives in edge POPs, and such an edge POP gets
isolated from the FB network but still has public Internet peering, how do
we ensure that edge POP does not continue exporting the DNS prefix into the
DFZ and serving stale records?), but perhaps also still solvable

I'm sure they'll learn from this and in the future have some better things
> in place to account for such a scenario.


100%

I think we can say with some level of confidence that there is going to be
a *lot* of discussion and re-evaluation of inter-service dependencies.

-- 
Hugo Slabbert


On Wed, Oct 6, 2021 at 9:48 AM Tom Beecher  wrote:

> I mean, at the end of the day they likely designed these systems to be
> able to handle one or more datacenters being disconnected from the world,
> and considered a scenario of ALL their datacenters being disconnected from
> the world so unlikely they chose not to solve for it. Works great, until it
> doesn't.
>
> I'm sure they'll learn from this and in the future have some better
> things in place to account for such a scenario.
>
> On Wed, Oct 6, 2021 at 12:21 PM Bjørn Mork  wrote:
>
>> Tom Beecher  writes:
>>
>> >  Even if the external
>> > announcements were not withdrawn, and the edge DNS servers could provide
>> > stale answers, the IPs those answers provided wouldn't have actually
>> been
>> > reachable
>>
>> Do we actually know this wrt the tools referred to in "the total loss of
>> DNS broke many of the tools we’d normally use to investigate and resolve
>> outages like this."?  Those tools aren't necessarily located in any of
>> the remote data centers, and some of them might even refer to resources
>> outside the facebook network.
>>
>> Not to mention that keeping the DNS service up would have prevented
>> resolver overload in the rest of the world.
>>
>> Besides, the disconnected frontend servers are probably configured to
>> display a "we have a slight technical issue. will be right back" notice
>> in such situations.  This is a much better user experience that the
>> "facebook?  never heard of it" message we got on monday.
>>
>> yes, it makes sense to keep your domains alive even if your network
>> isn't.  That's why the best practice is name servers in more than one
>> AS.
>>
>>
>>
>>
>> Bjørn
>>
>


Re: Better description of what happened

2021-10-06 Thread Tom Beecher
I mean, at the end of the day they likely designed these systems to be able
to handle one or more datacenters being disconnected from the world, and
considered a scenario of ALL their datacenters being disconnected from the
world so unlikely they chose not to solve for it. Works great, until it
doesn't.

I'm sure they'll learn from this and in the future have some better
things in place to account for such a scenario.

On Wed, Oct 6, 2021 at 12:21 PM Bjørn Mork  wrote:

> Tom Beecher  writes:
>
> >  Even if the external
> > announcements were not withdrawn, and the edge DNS servers could provide
> > stale answers, the IPs those answers provided wouldn't have actually been
> > reachable
>
> Do we actually know this wrt the tools referred to in "the total loss of
> DNS broke many of the tools we’d normally use to investigate and resolve
> outages like this."?  Those tools aren't necessarily located in any of
> the remote data centers, and some of them might even refer to resources
> outside the facebook network.
>
> Not to mention that keeping the DNS service up would have prevented
> resolver overload in the rest of the world.
>
> Besides, the disconnected frontend servers are probably configured to
> display a "we have a slight technical issue. will be right back" notice
> in such situations.  This is a much better user experience that the
> "facebook?  never heard of it" message we got on monday.
>
> yes, it makes sense to keep your domains alive even if your network
> isn't.  That's why the best practice is name servers in more than one
> AS.
>
>
>
>
> Bjørn
>


Re: Better description of what happened

2021-10-06 Thread Bjørn Mork
Tom Beecher  writes:

>  Even if the external
> announcements were not withdrawn, and the edge DNS servers could provide
> stale answers, the IPs those answers provided wouldn't have actually been
> reachable

Do we actually know this wrt the tools referred to in "the total loss of
DNS broke many of the tools we’d normally use to investigate and resolve
outages like this."?  Those tools aren't necessarily located in any of
the remote data centers, and some of them might even refer to resources
outside the facebook network.

Not to mention that keeping the DNS service up would have prevented
resolver overload in the rest of the world.

Besides, the disconnected frontend servers are probably configured to
display a "we have a slight technical issue. will be right back" notice
in such situations.  This is a much better user experience that the
"facebook?  never heard of it" message we got on monday.

yes, it makes sense to keep your domains alive even if your network
isn't.  That's why the best practice is name servers in more than one
AS.




Bjørn


Re: Better description of what happened

2021-10-06 Thread PJ Capelli via NANOG
I probably still have my US Robotics 14.4 in the basement, but it's been awhile 
since I've had access to a POTS line it would work on ... :)

pj capelli
pjcape...@pm.me

"Never to get lost, is not living" - Rebecca Solnit

Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐

On Wednesday, October 6th, 2021 at 10:41 AM, Curtis Maurand 
 wrote:

> On 10/5/21 5:51 AM, scott wrote:
> 

> > On 10/5/21 8:39 PM, Michael Thomas wrote:
> > 

> > > This bit posted by Randy might get lost in the other thread, but it 
> > > appears that their DNS withdraws BGP routes for prefixes that they can't 
> > > reach or are flaky it seems. Apparently that goes for the prefixes that 
> > > the name servers are on too. This caused internal outages too as it seems 
> > > they use their front facing DNS just like everybody else.
> > > 

> > > Sounds like they might consider having at least one split horizon server 
> > > internally. Lots of fodder here.
> 

> even a POTS line connected to a modem connected to a serial port on a 
> workstation in the data enter so that you can talk to whatever you need to 
> talk to.  I would go so far as to have other outgoing serial connections to 
> routers from that workstation. It's ugly, but it provides remote out of band 
> disaster management.  Just sayin'
> 

> > 
> > 

> > Move fast; break things? :)
> > 

> > scott
> > 

> > > >

signature.asc
Description: OpenPGP digital signature


Re: Better description of what happened

2021-10-06 Thread Tom Beecher
By what they have said publicly, the initial trigger point was that all of
their datacenters were disconnected from their internal backbone, thus
unreachable.

Once that occurs, nothing else really matters. Even if the external
announcements were not withdrawn, and the edge DNS servers could provide
stale answers, the IPs those answers provided wouldn't have actually been
reachable, and there wouldn't be 3 days of red herring conversations about
DNS design.

No DNS design exists that can help people reach resources not network
reachable. /shrug


On Tue, Oct 5, 2021 at 6:30 PM Hugo Slabbert  wrote:

> Had some chats with other folks:
> Arguably you could change the nameserver isolation check failure action to
> be "depref your exports" rather than "yank it all".  Basically, set up a
> tiered setup so the boxes passing those additional health checks and that
> should have correct entries would be your primary destination and failing
> nodes shouldn't receive query traffic since they're depref'd in your
> internal routing.  But in case all nodes fail that check simultaneously,
> those nodes failing the isolation check would attract traffic again as no
> better paths remain.  Better to serve stale data than none at all; CAP
> theorem trade-offs at work?
>
> --
> Hugo Slabbert
>
>
> On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas  wrote:
>
>>
>> On 10/5/21 3:09 PM, Andy Brezinsky wrote:
>>
>> It's a few years old, but Facebook has talked a little bit about their
>> DNS infrastructure before.  Here's a little clip that talks about
>> Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073
>>
>> From their outage report, it sounds like their authoritative DNS servers
>> withdraw their anycast announcements when they're unhealthy.  The health
>> check from those servers must have relied on something upstream.  Maybe
>> they couldn't talk to Cartographer for a few minutes so they thought they
>> might be isolated from the rest of the network and they decided to withdraw
>> their routes instead of serving stale data.  Makes sense when a single node
>> does it, not so much when the entire fleet thinks that they're out on their
>> own.
>>
>> A performance issue in Cartographer (or whatever manages this fleet these
>> days) could have been the ticking time bomb that set the whole thing in
>> motion.
>>
>> Rereading it is said that their internal (?) backbone went down so
>> pulling the routes was arguably the right thing to do. Or at least not flat
>> out wrong. Taking out their nameserver subnets was clearly a problem
>> though, though a fix is probably tricky since you clearly want to take down
>> errant nameservers too.
>>
>>
>> Mike
>>
>>
>>
>>
>>


Re: Better description of what happened

2021-10-06 Thread Curtis Maurand



On 10/5/21 5:51 AM, scott wrote:



On 10/5/21 8:39 PM, Michael Thomas wrote:


This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they 
can't reach or are flaky it seems. Apparently that goes for the 
prefixes that the name servers are on too. This caused internal 
outages too as it seems they use their front facing DNS just like 
everybody else.


Sounds like they might consider having at least one split horizon 
server internally. Lots of fodder here.




even a POTS line connected to a modem connected to a serial port on a 
workstation in the data enter so that you can talk to whatever you need 
to talk to.  I would go so far as to have other outgoing serial 
connections to routers from that workstation. It's ugly, but it provides 
remote out of band disaster management.  Just sayin'







Move fast; break things? :)


scott






















Re: Better description of what happened

2021-10-05 Thread Hugo Slabbert
Had some chats with other folks:
Arguably you could change the nameserver isolation check failure action to
be "depref your exports" rather than "yank it all".  Basically, set up a
tiered setup so the boxes passing those additional health checks and that
should have correct entries would be your primary destination and failing
nodes shouldn't receive query traffic since they're depref'd in your
internal routing.  But in case all nodes fail that check simultaneously,
those nodes failing the isolation check would attract traffic again as no
better paths remain.  Better to serve stale data than none at all; CAP
theorem trade-offs at work?

-- 
Hugo Slabbert


On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas  wrote:

>
> On 10/5/21 3:09 PM, Andy Brezinsky wrote:
>
> It's a few years old, but Facebook has talked a little bit about their DNS
> infrastructure before.  Here's a little clip that talks about Cartographer:
> https://youtu.be/bxhYNfFeVF4?t=2073
>
> From their outage report, it sounds like their authoritative DNS servers
> withdraw their anycast announcements when they're unhealthy.  The health
> check from those servers must have relied on something upstream.  Maybe
> they couldn't talk to Cartographer for a few minutes so they thought they
> might be isolated from the rest of the network and they decided to withdraw
> their routes instead of serving stale data.  Makes sense when a single node
> does it, not so much when the entire fleet thinks that they're out on their
> own.
>
> A performance issue in Cartographer (or whatever manages this fleet these
> days) could have been the ticking time bomb that set the whole thing in
> motion.
>
> Rereading it is said that their internal (?) backbone went down so pulling
> the routes was arguably the right thing to do. Or at least not flat out
> wrong. Taking out their nameserver subnets was clearly a problem though,
> though a fix is probably tricky since you clearly want to take down errant
> nameservers too.
>
>
> Mike
>
>
>
>
>


Re: Better description of what happened

2021-10-05 Thread Michael Thomas


On 10/5/21 3:09 PM, Andy Brezinsky wrote:


It's a few years old, but Facebook has talked a little bit about their 
DNS infrastructure before.  Here's a little clip that talks about 
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073


From their outage report, it sounds like their authoritative DNS 
servers withdraw their anycast announcements when they're unhealthy.  
The health check from those servers must have relied on something 
upstream.  Maybe they couldn't talk to Cartographer for a few minutes 
so they thought they might be isolated from the rest of the network 
and they decided to withdraw their routes instead of serving stale 
data.  Makes sense when a single node does it, not so much when the 
entire fleet thinks that they're out on their own.


A performance issue in Cartographer (or whatever manages this fleet 
these days) could have been the ticking time bomb that set the whole 
thing in motion.


Rereading it is said that their internal (?) backbone went down so 
pulling the routes was arguably the right thing to do. Or at least not 
flat out wrong. Taking out their nameserver subnets was clearly a 
problem though, though a fix is probably tricky since you clearly want 
to take down errant nameservers too.



Mike









Re: Better description of what happened

2021-10-05 Thread Andy Brezinsky
It's a few years old, but Facebook has talked a little bit about their 
DNS infrastructure before.  Here's a little clip that talks about 
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073


From their outage report, it sounds like their authoritative DNS 
servers withdraw their anycast announcements when they're unhealthy.  
The health check from those servers must have relied on something 
upstream.  Maybe they couldn't talk to Cartographer for a few minutes so 
they thought they might be isolated from the rest of the network and 
they decided to withdraw their routes instead of serving stale data.  
Makes sense when a single node does it, not so much when the entire 
fleet thinks that they're out on their own.


A performance issue in Cartographer (or whatever manages this fleet 
these days) could have been the ticking time bomb that set the whole 
thing in motion.


On 10/5/21 3:39 PM, Michael Thomas wrote:


This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they 
can't reach or are flaky it seems. Apparently that goes for the 
prefixes that the name servers are on too. This caused internal 
outages too as it seems they use their front facing DNS just like 
everybody else.


Sounds like they might consider having at least one split horizon 
server internally. Lots of fodder here.


Mike

On 10/5/21 11:11 AM, Randy Monroe wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/


On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas > wrote:



On 10/5/21 12:17 AM, Carsten Bormann wrote:
> On 5. Oct 2021, at 07:42, William Herrin mailto:b...@herrin.us>> wrote:
>> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas mailto:m...@mtcc.com>> wrote:
>>> They have a monkey patch subsystem. Lol.
>> Yes, actually, they do. They use Chef extensively to configure
>> operating systems. Chef is written in Ruby. Ruby has something
called
>> Monkey Patches.
> While Ruby indeed has a chain-saw (read: powerful, dangerous,
still the tool of choice in certain cases) in its toolkit that is
generally called “monkey-patching”, I think Michael was actually
thinking about the “chaos monkey”,
> https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
> https://netflix.github.io/chaosmonkey/

No, chaos monkey is a purposeful thing to induce corner case
errors so
they can be fixed. The earlier outage involved a config sanitizer
that
screwed up and then pushed it out. I can't get my head around why
anybody thought that was a good idea vs rejecting it and making
somebody
fix the config.

Mike




--

Randy Monroe

Network Engineering

Uber 








Re: Better description of what happened

2021-10-05 Thread scott


On 10/5/21 8:39 PM, Michael Thomas wrote:


This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they 
can't reach or are flaky it seems. Apparently that goes for the 
prefixes that the name servers are on too. This caused internal 
outages too as it seems they use their front facing DNS just like 
everybody else.


Sounds like they might consider having at least one split horizon 
server internally. Lots of fodder here.







Move fast; break things? :)


scott




















Better description of what happened

2021-10-05 Thread Michael Thomas
This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they can't 
reach or are flaky it seems. Apparently that goes for the prefixes that 
the name servers are on too. This caused internal outages too as it 
seems they use their front facing DNS just like everybody else.


Sounds like they might consider having at least one split horizon server 
internally. Lots of fodder here.


Mike

On 10/5/21 11:11 AM, Randy Monroe wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 



On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas > wrote:



On 10/5/21 12:17 AM, Carsten Bormann wrote:
> On 5. Oct 2021, at 07:42, William Herrin mailto:b...@herrin.us>> wrote:
>> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas mailto:m...@mtcc.com>> wrote:
>>> They have a monkey patch subsystem. Lol.
>> Yes, actually, they do. They use Chef extensively to configure
>> operating systems. Chef is written in Ruby. Ruby has something
called
>> Monkey Patches.
> While Ruby indeed has a chain-saw (read: powerful, dangerous,
still the tool of choice in certain cases) in its toolkit that is
generally called “monkey-patching”, I think Michael was actually
thinking about the “chaos monkey”,
> https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey

> https://netflix.github.io/chaosmonkey/


No, chaos monkey is a purposeful thing to induce corner case
errors so
they can be fixed. The earlier outage involved a config sanitizer
that
screwed up and then pushed it out. I can't get my head around why
anybody thought that was a good idea vs rejecting it and making
somebody
fix the config.

Mike




--

Randy Monroe

Network Engineering

Uber