Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/6/21 06:51, Hank Nussbacher wrote:



- "During one of these routine maintenance jobs, a command was issued 
with the intention to assess the availability of global backbone 
capacity, which unintentionally took down all the connections in our 
backbone network"


Can anyone guess as to what command FB issued that would cause them to 
withdraw all those prefixes?


Hard to say, as it seems that the command was innocent enough, perhaps 
running a batch of other sub-commands to check port status, bandwidth 
utilization, MPLS-TE values, e.t.c. However, sounds like another 
unforeseen bug in the command ran other things, or the cascade process 
of how the sub-commands were ran caused unforeseen problems.


We shall guess this one forever, as I doubt Facebook will go into that 
much detail.


What I can tell you is that all the major content providers spend a lot 
of time, money and effort in automating both capacity planning, as well 
as capacity auditing. It's a bit more complex for them, because their 
variables aren't just links and utilization, but also locations, fibre 
availability, fibre pricing, capacity lease pricing, the presence of 
carrier-neutral data centres, the presence of exchange points, current 
vendor equipment models and pricing, projection of future fibre and 
capacity pricing, e.t.c.


It's a totally different world from normal ISP-land.




- "it was not possible to access our data centers through our normal 
means because their networks were down, and second, the total loss of 
DNS broke many of the internal tools we’d normally use to investigate 
and resolve outages like this.  Our primary and out-of-band network 
access was down..."


Does this mean that FB acknowledges that the loss of DNS broke their 
OOB access?


I need to put my thinking cap on, but not sure whether running DNS in 
the IGP would have been better in this instance.


We run our Anycast DNS network in our IGP, mainly to always guarantee 
latency-based routing, but also to ensure that the failure of a 
higher-level protocol like BGP does not disconnect internal access that 
is needed for troubleshooting and repair. Given the IGP is a much more 
lower-level routing protocol, it's more likely (not impossible) that it 
would not go down with BGP.


In the past, we have, indeed, had BGP issues that allowed us to maintain 
DNS access internally as the IGP was unaffected.


The final statement from that report is interesting:

    "From here on out, our job is to strengthen our testing,
    drills, and overall resilience to make sure events like this
    happen as rarely as possible."

... which, in my rudimentary translation, means that:

    "There are no guarantees that our automation software will not
    poop cows again, but we hope that when that does happen, we
    shall be able to send our guys out to site much more quickly."

... which, to be fair, is totally understandable. These automation 
tools, especially in large networks such as BigContent, are 
significantly more fragile the more complex they get, and the more batch 
tasks they need to perform on various parts of a network of this size 
and scope. It's a pity these automation tools are all homegrown, and 
can't be bought "pre-packaged and pre-approved to never fail" from IT 
Software Store down the road. But it's the only way for networks of this 
capacity to operate, and the risk they always sit with for being that large.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher

On 05/10/2021 21:11, Randy Monroe via NANOG wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 


Lets try to breakdown this "engineering" blog posting:

- "During one of these routine maintenance jobs, a command was issued 
with the intention to assess the availability of global backbone 
capacity, which unintentionally took down all the connections in our 
backbone network"


Can anyone guess as to what command FB issued that would cause them to 
withdraw all those prefixes?


- "it was not possible to access our data centers through our normal 
means because their networks were down, and second, the total loss of 
DNS broke many of the internal tools we’d normally use to investigate 
and resolve outages like this.  Our primary and out-of-band network 
access was down..."


Does this mean that FB acknowledges that the loss of DNS broke their OOB 
access?


-Hank


Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta

Randy Monroe via NANOG wrote:


Updated:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/


So, what was lost was internal connectivity between data centers.

That facebook use very short expiration period for zone
data is a separate issue.

As long as name servers with expired zone data won't serve
request from outside of facebook, whether BGP routes to the
name servers are announced or not is unimportant.

Masataka Ohta


Re: Better description of what happened

2021-10-05 Thread Hugo Slabbert
Had some chats with other folks:
Arguably you could change the nameserver isolation check failure action to
be "depref your exports" rather than "yank it all".  Basically, set up a
tiered setup so the boxes passing those additional health checks and that
should have correct entries would be your primary destination and failing
nodes shouldn't receive query traffic since they're depref'd in your
internal routing.  But in case all nodes fail that check simultaneously,
those nodes failing the isolation check would attract traffic again as no
better paths remain.  Better to serve stale data than none at all; CAP
theorem trade-offs at work?

-- 
Hugo Slabbert


On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas  wrote:

>
> On 10/5/21 3:09 PM, Andy Brezinsky wrote:
>
> It's a few years old, but Facebook has talked a little bit about their DNS
> infrastructure before.  Here's a little clip that talks about Cartographer:
> https://youtu.be/bxhYNfFeVF4?t=2073
>
> From their outage report, it sounds like their authoritative DNS servers
> withdraw their anycast announcements when they're unhealthy.  The health
> check from those servers must have relied on something upstream.  Maybe
> they couldn't talk to Cartographer for a few minutes so they thought they
> might be isolated from the rest of the network and they decided to withdraw
> their routes instead of serving stale data.  Makes sense when a single node
> does it, not so much when the entire fleet thinks that they're out on their
> own.
>
> A performance issue in Cartographer (or whatever manages this fleet these
> days) could have been the ticking time bomb that set the whole thing in
> motion.
>
> Rereading it is said that their internal (?) backbone went down so pulling
> the routes was arguably the right thing to do. Or at least not flat out
> wrong. Taking out their nameserver subnets was clearly a problem though,
> though a fix is probably tricky since you clearly want to take down errant
> nameservers too.
>
>
> Mike
>
>
>
>
>


Re: Better description of what happened

2021-10-05 Thread Michael Thomas


On 10/5/21 3:09 PM, Andy Brezinsky wrote:


It's a few years old, but Facebook has talked a little bit about their 
DNS infrastructure before.  Here's a little clip that talks about 
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073


From their outage report, it sounds like their authoritative DNS 
servers withdraw their anycast announcements when they're unhealthy.  
The health check from those servers must have relied on something 
upstream.  Maybe they couldn't talk to Cartographer for a few minutes 
so they thought they might be isolated from the rest of the network 
and they decided to withdraw their routes instead of serving stale 
data.  Makes sense when a single node does it, not so much when the 
entire fleet thinks that they're out on their own.


A performance issue in Cartographer (or whatever manages this fleet 
these days) could have been the ticking time bomb that set the whole 
thing in motion.


Rereading it is said that their internal (?) backbone went down so 
pulling the routes was arguably the right thing to do. Or at least not 
flat out wrong. Taking out their nameserver subnets was clearly a 
problem though, though a fix is probably tricky since you clearly want 
to take down errant nameservers too.



Mike









Re: Better description of what happened

2021-10-05 Thread Andy Brezinsky
It's a few years old, but Facebook has talked a little bit about their 
DNS infrastructure before.  Here's a little clip that talks about 
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073


From their outage report, it sounds like their authoritative DNS 
servers withdraw their anycast announcements when they're unhealthy.  
The health check from those servers must have relied on something 
upstream.  Maybe they couldn't talk to Cartographer for a few minutes so 
they thought they might be isolated from the rest of the network and 
they decided to withdraw their routes instead of serving stale data.  
Makes sense when a single node does it, not so much when the entire 
fleet thinks that they're out on their own.


A performance issue in Cartographer (or whatever manages this fleet 
these days) could have been the ticking time bomb that set the whole 
thing in motion.


On 10/5/21 3:39 PM, Michael Thomas wrote:


This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they 
can't reach or are flaky it seems. Apparently that goes for the 
prefixes that the name servers are on too. This caused internal 
outages too as it seems they use their front facing DNS just like 
everybody else.


Sounds like they might consider having at least one split horizon 
server internally. Lots of fodder here.


Mike

On 10/5/21 11:11 AM, Randy Monroe wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/


On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas > wrote:



On 10/5/21 12:17 AM, Carsten Bormann wrote:
> On 5. Oct 2021, at 07:42, William Herrin mailto:b...@herrin.us>> wrote:
>> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas mailto:m...@mtcc.com>> wrote:
>>> They have a monkey patch subsystem. Lol.
>> Yes, actually, they do. They use Chef extensively to configure
>> operating systems. Chef is written in Ruby. Ruby has something
called
>> Monkey Patches.
> While Ruby indeed has a chain-saw (read: powerful, dangerous,
still the tool of choice in certain cases) in its toolkit that is
generally called “monkey-patching”, I think Michael was actually
thinking about the “chaos monkey”,
> https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
> https://netflix.github.io/chaosmonkey/

No, chaos monkey is a purposeful thing to induce corner case
errors so
they can be fixed. The earlier outage involved a config sanitizer
that
screwed up and then pushed it out. I can't get my head around why
anybody thought that was a good idea vs rejecting it and making
somebody
fix the config.

Mike




--

Randy Monroe

Network Engineering

Uber 








Re: Better description of what happened

2021-10-05 Thread scott


On 10/5/21 8:39 PM, Michael Thomas wrote:


This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they 
can't reach or are flaky it seems. Apparently that goes for the 
prefixes that the name servers are on too. This caused internal 
outages too as it seems they use their front facing DNS just like 
everybody else.


Sounds like they might consider having at least one split horizon 
server internally. Lots of fodder here.







Move fast; break things? :)


scott




















Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas
Actually for card readers, the offline verification nature of 
certificates is probably a nice property. But client certs pose all 
sorts of other problems like their scalability, ease of making changes 
(roles, etc), and other kinds of considerations that make you want to 
fetch more information online... which completely negates the advantages 
of offline verification. Just the CRL problem would probably sink you 
since when you fire an employee you want access to be cut off immediately.


The other thing that would scare me in general with expecting offline 
verification is the *reason* it's being used is for offline might get 
forgotten and back comes the online dependencies while nobody is looking.


BTW: you don't need to reach the trust anchor, though you almost 
certainly need to run OCSP or something like it if you have client certs.


Mike

On 10/5/21 1:34 PM, Matthew Petach wrote:



On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.) > wrote:


Why ever would have a card reader on your external facing network,
if that was really the case why they couldn't get in to fix it?


Let's hypothesize for a moment.

Let's suppose you've decided that certificate-based
authentication is the cat's meow, and so you've got
dot1x authentication on every network port in your
corporate environment, all your users are authenticated
via certificates, all properly signed all the way up the
chain to the root trust anchor.

Life is good.

But then you have a bad network day.  Suddenly,
you can't talk to upstream registries/registrars,
you can't reach the trust anchor for your certificates,
and you discover that all the laptops plugged into
your network switches are failing to validate their
authenticity; sure, you're on the network, but you're
in a guest vlan, with no access.  Your user credentials
aren't able to be validated, so you're stuck with the
base level of access, which doesn't let you into the
OOB network.

Turns out your card readers were all counting on
dot1x authentication to get them into the right vlan
as well, and with the network buggered up, the
switches can't validate *their* certificates either,
so the door badge card readers just flash their
LEDs impotently when you wave your badge at
them.

Remember, one attribute of certificates is that they are
designated as valid for a particular domain, or set of
subdomains with a wildcard; that is, an authenticator needs
to know where the certificate is being presented to know if
it is valid within that scope or not.   You can do that scope
validation through several different mechanisms,
such as through a chain of trust to a certificate authority,
or through DNSSEC with DANE--but fundamentally,
all certificates have a scope within which they are valid,
and a means to identify in which scope they are being
used.  And wether your certificate chain of trust is
being determined by certificate authorities or DANE,
they all require that trust to be validated by something
other than the client and server alone--which generally
makes them dependent on some level of external
network connectivity being present in order to properly
function.   [yes, yes, we can have a side discussion about
having every authentication server self-sign certificates
as its own CA, and thus eliminate external network
connectivity dependencies--but that's an administrative
nightmare that I don't think any large organization would
sign up for.]

So, all of the client certificates and authorization servers
we're talking about exist on your internal network, but they
all counted on reachability to your infrastructure
servers in order to properly authenticate and grant
access to devices and people.  If your BGP update
made your infrastructure servers, such as DNS servers,
become unreachable, then suddenly you might well
find yourself locked out both physically and logically
from your own network.

Again, this is purely hypothetical, but it's one scenario
in which a routing-level "oops" could end up causing
physical-entry denial, as well as logical network access
level denial, without actually having those authentication
systems on external facing networks.

Certificate-based authentication is scalable and cool, but
it's really important to think about even generally "that'll
never happen" failure scenarios when deploying it into
critical systems.  It's always good to have the "break glass
in case of emergency" network that doesn't rely on dot1x,
that works without DNS, without NTP, without RADIUS,
or any other external system, with a binder with printouts
of the IP addresses of all your really critical servers and
routers in it which gets updated a few times a year, so that
when the SHTF, a person sitting at a laptop plugged into
that network with the binder next to them can get into the
emergency-only local account on each router to fix things.

And yes, you want every command that local emergency-only
user types into a router to be logged, because someone

Better description of what happened

2021-10-05 Thread Michael Thomas
This bit posted by Randy might get lost in the other thread, but it 
appears that their DNS withdraws BGP routes for prefixes that they can't 
reach or are flaky it seems. Apparently that goes for the prefixes that 
the name servers are on too. This caused internal outages too as it 
seems they use their front facing DNS just like everybody else.


Sounds like they might consider having at least one split horizon server 
internally. Lots of fodder here.


Mike

On 10/5/21 11:11 AM, Randy Monroe wrote:
Updated: 
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 



On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas > wrote:



On 10/5/21 12:17 AM, Carsten Bormann wrote:
> On 5. Oct 2021, at 07:42, William Herrin mailto:b...@herrin.us>> wrote:
>> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas mailto:m...@mtcc.com>> wrote:
>>> They have a monkey patch subsystem. Lol.
>> Yes, actually, they do. They use Chef extensively to configure
>> operating systems. Chef is written in Ruby. Ruby has something
called
>> Monkey Patches.
> While Ruby indeed has a chain-saw (read: powerful, dangerous,
still the tool of choice in certain cases) in its toolkit that is
generally called “monkey-patching”, I think Michael was actually
thinking about the “chaos monkey”,
> https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey

> https://netflix.github.io/chaosmonkey/


No, chaos monkey is a purposeful thing to induce corner case
errors so
they can be fixed. The earlier outage involved a config sanitizer
that
screwed up and then pushed it out. I can't get my head around why
anybody thought that was a good idea vs rejecting it and making
somebody
fix the config.

Mike




--

Randy Monroe

Network Engineering

Uber 








Re: Facebook post-mortems...

2021-10-05 Thread Matthew Petach
On Tue, Oct 5, 2021 at 8:57 AM Kain, Becki (.)  wrote:

> Why ever would have a card reader on your external facing network, if that
> was really the case why they couldn't get in to fix it?
>

Let's hypothesize for a moment.

Let's suppose you've decided that certificate-based
authentication is the cat's meow, and so you've got
dot1x authentication on every network port in your
corporate environment, all your users are authenticated
via certificates, all properly signed all the way up the
chain to the root trust anchor.

Life is good.

But then you have a bad network day.  Suddenly,
you can't talk to upstream registries/registrars,
you can't reach the trust anchor for your certificates,
and you discover that all the laptops plugged into
your network switches are failing to validate their
authenticity; sure, you're on the network, but you're
in a guest vlan, with no access.  Your user credentials
aren't able to be validated, so you're stuck with the
base level of access, which doesn't let you into the
OOB network.

Turns out your card readers were all counting on
dot1x authentication to get them into the right vlan
as well, and with the network buggered up, the
switches can't validate *their* certificates either,
so the door badge card readers just flash their
LEDs impotently when you wave your badge at
them.

Remember, one attribute of certificates is that they are
designated as valid for a particular domain, or set of
subdomains with a wildcard; that is, an authenticator needs
to know where the certificate is being presented to know if
it is valid within that scope or not.   You can do that scope
validation through several different mechanisms,
such as through a chain of trust to a certificate authority,
or through DNSSEC with DANE--but fundamentally,
all certificates have a scope within which they are valid,
and a means to identify in which scope they are being
used.  And wether your certificate chain of trust is
being determined by certificate authorities or DANE,
they all require that trust to be validated by something
other than the client and server alone--which generally
makes them dependent on some level of external
network connectivity being present in order to properly
function.   [yes, yes, we can have a side discussion about
having every authentication server self-sign certificates
as its own CA, and thus eliminate external network
connectivity dependencies--but that's an administrative
nightmare that I don't think any large organization would
sign up for.]

So, all of the client certificates and authorization servers
we're talking about exist on your internal network, but they
all counted on reachability to your infrastructure
servers in order to properly authenticate and grant
access to devices and people.  If your BGP update
made your infrastructure servers, such as DNS servers,
become unreachable, then suddenly you might well
find yourself locked out both physically and logically
from your own network.

Again, this is purely hypothetical, but it's one scenario
in which a routing-level "oops" could end up causing
physical-entry denial, as well as logical network access
level denial, without actually having those authentication
systems on external facing networks.

Certificate-based authentication is scalable and cool, but
it's really important to think about even generally "that'll
never happen" failure scenarios when deploying it into
critical systems.  It's always good to have the "break glass
in case of emergency" network that doesn't rely on dot1x,
that works without DNS, without NTP, without RADIUS,
or any other external system, with a binder with printouts
of the IP addresses of all your really critical servers and
routers in it which gets updated a few times a year, so that
when the SHTF, a person sitting at a laptop plugged into
that network with the binder next to them can get into the
emergency-only local account on each router to fix things.

And yes, you want every command that local emergency-only
user types into a router to be logged, because someone
wanting to create mischief in your network is going to aim
for that account access if they can get it; so watch it like a
hawk, and the only time it had better be accessed and used
is when the big red panic button has already been hit, and
the executives are huddled around speakerphones wanting
to know just how fast you can get things working again.  ^_^;

I know nothing of the incident in question.  But sitting at home,
hypothesizing about ways in which things could go wrong, this
is one of the reasons why I still configure static emergency
accounts on network devices, even with centrally administered
account systems, and why there's always a set of "no dot1x"
ports that work to get into the OOB/management network even
when everything else has gone toes-up.   :)

So--that's one way in which an outage like this could have
locked people out of buildings.   ^_^;

Thanks!

Matt
[ready for the deluge of people pointing out I've 

Re: Facebook post-mortems...

2021-10-05 Thread Randy Monroe via NANOG
Updated:
https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas  wrote:

>
> On 10/5/21 12:17 AM, Carsten Bormann wrote:
> > On 5. Oct 2021, at 07:42, William Herrin  wrote:
> >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
> >>> They have a monkey patch subsystem. Lol.
> >> Yes, actually, they do. They use Chef extensively to configure
> >> operating systems. Chef is written in Ruby. Ruby has something called
> >> Monkey Patches.
> > While Ruby indeed has a chain-saw (read: powerful, dangerous, still the
> tool of choice in certain cases) in its toolkit that is generally called
> “monkey-patching”, I think Michael was actually thinking about the “chaos
> monkey”,
> > https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
> > https://netflix.github.io/chaosmonkey/
>
> No, chaos monkey is a purposeful thing to induce corner case errors so
> they can be fixed. The earlier outage involved a config sanitizer that
> screwed up and then pushed it out. I can't get my head around why
> anybody thought that was a good idea vs rejecting it and making somebody
> fix the config.
>
> Mike
>
>
>

-- 

Randy Monroe

Network Engineering

[image: Uber] 


RE: HBO Max Contact

2021-10-05 Thread Travis Garrison
We have just ran into this issue. We contacted Digital Elements and they let us 
know the issue is with Wind Scribe VPN service. Wind Scribe will randomly 
select client IP addresses and use that as the host IP. Of course, when they do 
that, it gets our IP addresses blocked. We have never implemented any kind of 
filtering, etc. on our network, we prefer to leave it open for the customers to 
use. Anyone have a good idea on how to prevent this from happening?

Thank you
Travis Garrison

-Original Message-
From: NANOG  On Behalf Of 
Lukas Tribus
Sent: Tuesday, September 7, 2021 12:27 PM
To: Kevin McCormick 
Cc: Nanog@nanog.org
Subject: Re: HBO Max Contact

Hello Kevin,


On Tue, 7 Sept 2021 at 16:57, Kevin McCormick  wrote:
>
> HBO did respond to contact form page on website.
>
>
> They referred us to Digital Elements.

It's IP geolocation done right, as per the white-paper [1]:

- distrusting WHOIS data
- distrusting ISP provided data
- not providing any check/demo page
- not providing any contact information for victims (end users or ISPs)
- amazing real time updates based on ... things:

> Digital Element utilizes patented web-spidering technology and 20+ 
> proprietary methods to triangulate the location, connection speed, and 
> many other characteristics associated with an IP address. By combining this 
> "inside-out" infrastructure analysis with "outside-in"
> user location feedback gleaned from a network of commercial partners 
> to improve and validate its response at a hyperlocal level 
> (city/postcode/ZIP+4), Digital Element can identify where the user 
> actually accesses the Internet down to the ISP’s end-point equipment.
> [...]
> "With such an extensive customer network performing more than 10 
> trillion IP lookups per month, the company is able to pick up IP 
> address reallocations the instant they occur, ensuring that data remains 
> highly current and accurate."


And just to reiterate one more time:

> By combining this "inside-out" infrastructure analysis with "outside-in"
> user location feedback gleaned from a network of commercial partners 
> to improve and validate its response at a hyperlocal level 
> (city/postcode/ZIP+4), Digital Element can identify where the user 
> actually accesses the Internet down to the ISP’s end-point equipment.

and again:

> the company is able to pick up IP address reallocations the instant 
> they occur


's all good, man!


[1] 
https://www.digitalelement.com/wp-content/uploads/2020/06/IPGEO-myths-facts.pdf


Re: Facebook post-mortems...

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:47 PM Miles Fidelman 
wrote:

> jcur...@istaff.org wrote:
>
> Fairly abstract - Facebook Engineering -
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> 
>
> Also, Cloudflare’s take on the outage -
> https://blog.cloudflare.com/october-2021-facebook-outage/
>
> FYI,
> /John
>
> This may be a dumb question, but does this suggest that Facebook publishes
> rather short TTLs for their DNS records?  Otherwise, why would an internal
> failure make them unreachable so quickly?
>

Looks like 60 seconds:

$  dig +norec star-mini.c10r.facebook.com. @d.ns.c10r.facebook.com.

; <<>> DiG 9.10.6 <<>> +norec star-mini.c10r.facebook.com. @
d.ns.c10r.facebook.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25582
;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;star-mini.c10r.facebook.com. IN A

;; ANSWER SECTION:
star-mini.c10r.facebook.com. 60 IN A 157.240.229.35

;; Query time: 42 msec
;; SERVER: 185.89.219.11#53(185.89.219.11)
;; WHEN: Tue Oct 05 14:01:06 EDT 2021
;; MSG SIZE  rcvd: 72



... and cue the "Bwahahhaha! If *I* ran Facebook I'd make the TTL be [2
sec|30sec|5min|1h|6h+3sec|1day|6months|maxint32]" threads

Choosing the TTL is a balancing act between stability, agility, load,
politeness, renewal latency, etc -- but I'm sure NANOG can boil it down to
"They did it wrong!..."

W


> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why.  ... unknown
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Facebook post-mortems... - Update!

2021-10-05 Thread Randy Bush
> Can someone explain to me, preferably in baby words, why so many providers
> view information like https://as37100.net/?bgp as secret/proprietary?

it shows we're important


Re: Disaster Recovery Process

2021-10-05 Thread Jamie Dahl
The NIMS/ICS system works very well for issues like this.   I utilize ICS 
regularly in my Search and Rescue world, and the last two companies I worked 
for utilize(d) it extensively during outages.  It allows folks from various 
different disciplines, roles and backgrounds to come in, and provide a divide 
and conquer methodology to incidents and can be scaled up/scaled out as 
necessary.  Phrases like "Incident Commander" and such have been around for a 
few decades and are concepts used regularly by FEMA, CalFire and other natural 
disaster style incidents.  But those of you who may be EMComm folks probably 
already knew that ;-). 



this was pounded out on my iPhone and i have fat fingers plus  two left thumbs 
:)

We have to remember that what we observe is not nature herself, but nature 
exposed to our method of questioning.


> On Oct 5, 2021, at 10:11, jim deleskie  wrote:
> 
> 
> World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.  
> Glass breaks super easy too and much less expensive then adding 15 min to 
> failure.
> 
> -jim
> 
>> On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz,  wrote:
>> 7. Make sure any access controlled rooms have physical keys that are 
>> available at need - and aren't secured by the same access control that they 
>> are to circumvent. . 
>> 8. Don't make your access control dependent on internet access - always have 
>> something on the local network  it can fall back to. 
>> 
>> That last thing, that apparently their access control failed, locking people 
>> out when either their outward facing DNS and/or BGP routes went goodbye, is 
>> perhaps the most astounding thing to me - making your access control into an 
>> IoT device without (apparently) a quick workaround for a failure in the "I" 
>> part.
>> 
>>> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>>> 
>>> 
>>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>>> > 
>>> > How come such a large operation does not have an out of bound access in 
>>> > case of emergencies ???
>>> > 
>>> > 
>>> 
>>> I mentioned to someone yesterday that most OOB systems _are_ the internet.  
>>> It doesn’t always seem like you need things like modems or dial-backup, or 
>>> access to these services, except when you do it’s critical/essential.
>>> 
>>> A few reminders for people:
>>> 
>>> 1) Program your co-workers into your cell phone
>>> 2) Print out an emergency contact sheet
>>> 3) Have a backup conference bridge/system that you test
>>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet? Audio 
>>> bridge?
>>>   - No judgement, but do test the system!
>>> 4) Know how to access the office and who is closest.  
>>>   - What happens if they are in the hospital, sick or on vacation?
>>> 5) Complacency is dangerous
>>>   - When the tools “just work” you never imagine the tools won’t work.  I’m 
>>> sure the lessons learned will be long internally.  
>>>   - I hope they share them externally so others can learn.
>>> 6) No really, test the backup process.
>>> 
>>> 
>>> 
>>> * interlude *
>>> 
>>> Back at my time at 2914 - one reason we all had T1’s at home was largely so 
>>> we could get in to the network should something bad happen.  My home IP 
>>> space was in the router ACLs.  Much changed since those early days as this 
>>> network became more reliable.  We’ve seen large outages in the past 2 years 
>>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in 
>>> my memory).  
>>> 
>>> Plan for the outages and make sure you understand your playbook.  It may be 
>>> from snow day to all hands on deck.  Test it at least once, and ideally 
>>> with someone who will challenge a few assumptions (eg: that the cell 
>>> network will be up)
>>> 
>>> - Jared
>> 
>> 
>> -- 
>> Jeff Shultz
>> 
>> 
>> Like us on Social Media for News, Promotions, and other information!!
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> *** This message contains confidential information and is intended only for 
>> the individual named. If you are not the named addressee you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately by e-mail if you have received this e-mail by mistake and delete 
>> this e-mail from your system. E-mail transmission cannot be guaranteed to be 
>> secure or error-free as information could be intercepted, corrupted, lost, 
>> destroyed, arrive late or incomplete, or contain viruses. The sender 
>> therefore does not accept liability for any errors or omissions in the 
>> contents of this message, which arise as a result of e-mail transmission. ***


Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
I don't see posting in a DR process thead about thinking to use alternative
entry methods to locked doors and spreading false information.  If do
well.  Mail filters are simple.

-jim

On Tue., Oct. 5, 2021, 7:35 p.m. Niels Bakker, 
wrote:

> * deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:
> >World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.
>
> Please stop spreading fake news.
>
> https://twitter.com/MikeIsaac/status/1445196576956162050
> |need to issue a correction: the team dispatched to the Facebook site
> |had issues getting in because of physical security but did not need to
> |use a saw/ grinder.
>
>
> -- Niels.
>


Re: Facebook post-mortems...

2021-10-05 Thread Miles Fidelman

jcur...@istaff.org wrote:
Fairly abstract - Facebook Engineering - 
https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr 



Also, Cloudflare’s take on the outage - 
https://blog.cloudflare.com/october-2021-facebook-outage/


FYI,
/John

This may be a dumb question, but does this suggest that Facebook 
publishes rather short TTLs for their DNS records?  Otherwise, why would 
an internal failure make them unreachable so quickly?


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.   Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why.  ... unknown



Re: Disaster Recovery Process

2021-10-05 Thread Niels Bakker

* deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:

World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.


Please stop spreading fake news.

https://twitter.com/MikeIsaac/status/1445196576956162050
|need to issue a correction: the team dispatched to the Facebook site
|had issues getting in because of physical security but did not need to
|use a saw/ grinder.


-- Niels.


Re: Disaster Recovery Process

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:07 PM Jeff Shultz  wrote:

> 7. Make sure any access controlled rooms have physical keys that are
> available at need - and aren't secured by the same access control that they
> are to circumvent. .
> 8. Don't make your access control dependent on internet access - always
> have something on the local network  it can fall back to.
>
> That last thing, that apparently their access control failed, locking
> people out when either their outward facing DNS and/or BGP routes went
> goodbye, is perhaps the most astounding thing to me - making your access
> control into an IoT device without (apparently) a quick workaround for a
> failure in the "I" part.
>

Keep in mind that the "some employees couldn't get into their offices" has
been filtered through the public press and seems to have grown into "OMG!
Lolz! No-one can fix the Facebook because no-one can reach the
turn-it-off-and-on-again-button".
Facebook has many office buildings, and needs to be able to add and revoke
employee access as people are hired and quit, etc. Just because the press
said that some random employees were unable to enter their office building
doesn't actually mean that: 1: this was a datacenter and they really needed
access or 2: no-one was able to enter or 3: this actually caused issues
with recovery.
Important buildings have security people who have controller-locked cards
and / or physical keys, offices != datacenter, etc.

I'm quite sure that this part of the story is a combination of some small
tidbit of information that a non-technical reporter was able to understand,
mixed with some "Hah. Look at those idiots, even I know to keep a spare key
under the doormat" schadenfreude.

W

>
> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>
>>
>>
>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>> >
>> > How come such a large operation does not have an out of bound access in
>> case of emergencies ???
>> >
>> >
>>
>> I mentioned to someone yesterday that most OOB systems _are_ the
>> internet.  It doesn’t always seem like you need things like modems or
>> dial-backup, or access to these services, except when you do it’s
>> critical/essential.
>>
>> A few reminders for people:
>>
>> 1) Program your co-workers into your cell phone
>> 2) Print out an emergency contact sheet
>> 3) Have a backup conference bridge/system that you test
>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
>> Audio bridge?
>>   - No judgement, but do test the system!
>> 4) Know how to access the office and who is closest.
>>   - What happens if they are in the hospital, sick or on vacation?
>> 5) Complacency is dangerous
>>   - When the tools “just work” you never imagine the tools won’t work.
>> I’m sure the lessons learned will be long internally.
>>   - I hope they share them externally so others can learn.
>> 6) No really, test the backup process.
>>
>>
>>
>> * interlude *
>>
>> Back at my time at 2914 - one reason we all had T1’s at home was largely
>> so we could get in to the network should something bad happen.  My home IP
>> space was in the router ACLs.  Much changed since those early days as this
>> network became more reliable.  We’ve seen large outages in the past 2 years
>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
>> my memory).
>>
>> Plan for the outages and make sure you understand your playbook.  It may
>> be from snow day to all hands on deck.  Test it at least once, and ideally
>> with someone who will challenge a few assumptions (eg: that the cell
>> network will be up)
>>
>> - Jared
>
>
>
> --
> Jeff Shultz
>
>
> Like us on Social Media for News, Promotions, and other information!!
>
> [image:
> https://www.instagram.com/sctc_sctc/]
> 
> 
> 
>
>
>
>
>
>
>
>  This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. 
>


-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas



On 10/5/21 12:17 AM, Carsten Bormann wrote:

On 5. Oct 2021, at 07:42, William Herrin  wrote:

On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:

They have a monkey patch subsystem. Lol.

Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches.

While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of 
choice in certain cases) in its toolkit that is generally called 
“monkey-patching”, I think Michael was actually thinking about the “chaos 
monkey”,
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
https://netflix.github.io/chaosmonkey/


No, chaos monkey is a purposeful thing to induce corner case errors so 
they can be fixed. The earlier outage involved a config sanitizer that 
screwed up and then pushed it out. I can't get my head around why 
anybody thought that was a good idea vs rejecting it and making somebody 
fix the config.


Mike




Re: Facebook post-mortems...

2021-10-05 Thread Michael Thomas



On 10/4/21 10:42 PM, William Herrin wrote:

On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:

They have a monkey patch subsystem. Lol.

Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches. This is where at an arbitrary location in the code you
re-open an object defined elsewhere and change its methods.

Chef doesn't always do the right thing. You tell Chef to remove an RPM
and it does. Even if it has to remove half the operating system to
satisfy the dependencies. If you want it to do something reasonable,
say throw an error because you didn't actually tell it to remove half
the operating system, you have a choice: spin up a fork of chef with a
couple patches to the chef-rpm interaction or just monkey-patch it in
one of your chef recipes.


Just because a language allows monkey patching doesn't mean that you 
should use it. In that particular outage they said that they fix up 
errant looking config files rather than throw an error and make somebody 
fix it. That is an extremely bad practice and frankly looks like amateur 
hour to me.


Mike



BGP communities, was: Re: Facebook post-mortems... - Update!

2021-10-05 Thread Jay Hennigan

On 10/5/21 09:49, Warren Kumari wrote:

Can someone explain to me, preferably in baby words, why so many 
providers view information like https://as37100.net/?bgp 
 as secret/proprietary?
I've interacted with numerous providers who require an NDA or 
pinky-swear to get a list of their communities -- is this really just 1: 
security through obscurity, 2: an artifact of the culture of not 
sharing, 3: an attempt to seem cool by making you jump through hoops to 
prove your worthiness, 4: some weird 'mah competitors won't be able to 
figure out my secret sauce without knowing that 17 means Asia, or 5: 
something else?


Not sure the rationale of leeping them secret, but at least one 
aggregated source of dozens of them exists and has been around for a 
long time. https://onestep.net/communities/


--
Jay Hennigan - j...@west.net
Network Engineering - CCIE #7880
503 897-8550 - WB6RDV


Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.
Glass breaks super easy too and much less expensive then adding 15 min to
failure.

-jim

On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz, 
wrote:

> 7. Make sure any access controlled rooms have physical keys that are
> available at need - and aren't secured by the same access control that they
> are to circumvent. .
> 8. Don't make your access control dependent on internet access - always
> have something on the local network  it can fall back to.
>
> That last thing, that apparently their access control failed, locking
> people out when either their outward facing DNS and/or BGP routes went
> goodbye, is perhaps the most astounding thing to me - making your access
> control into an IoT device without (apparently) a quick workaround for a
> failure in the "I" part.
>
> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>
>>
>>
>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>> >
>> > How come such a large operation does not have an out of bound access in
>> case of emergencies ???
>> >
>> >
>>
>> I mentioned to someone yesterday that most OOB systems _are_ the
>> internet.  It doesn’t always seem like you need things like modems or
>> dial-backup, or access to these services, except when you do it’s
>> critical/essential.
>>
>> A few reminders for people:
>>
>> 1) Program your co-workers into your cell phone
>> 2) Print out an emergency contact sheet
>> 3) Have a backup conference bridge/system that you test
>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
>> Audio bridge?
>>   - No judgement, but do test the system!
>> 4) Know how to access the office and who is closest.
>>   - What happens if they are in the hospital, sick or on vacation?
>> 5) Complacency is dangerous
>>   - When the tools “just work” you never imagine the tools won’t work.
>> I’m sure the lessons learned will be long internally.
>>   - I hope they share them externally so others can learn.
>> 6) No really, test the backup process.
>>
>>
>>
>> * interlude *
>>
>> Back at my time at 2914 - one reason we all had T1’s at home was largely
>> so we could get in to the network should something bad happen.  My home IP
>> space was in the router ACLs.  Much changed since those early days as this
>> network became more reliable.  We’ve seen large outages in the past 2 years
>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
>> my memory).
>>
>> Plan for the outages and make sure you understand your playbook.  It may
>> be from snow day to all hands on deck.  Test it at least once, and ideally
>> with someone who will challenge a few assumptions (eg: that the cell
>> network will be up)
>>
>> - Jared
>
>
>
> --
> Jeff Shultz
>
>
> Like us on Social Media for News, Promotions, and other information!!
>
> [image:
> https://www.instagram.com/sctc_sctc/]
> 
> 
> 
>
>
>
>
>
>
>
>  This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. 
>


Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker
Ryan, thanks for sharing your data, it's unfortunate that it was 
seemingly misinterpreted by a few souls.



* ryan.lan...@gmail.com (Ryan Landry) [Tue 05 Oct 2021, 17:52 CEST]:
Niels, you are correct about my initial tweet, which I updated in 
later tweets to clarify with a hat tip to Will Hargrave as thanks 
for seeking more detail.




Re: Disaster Recovery Process

2021-10-05 Thread Jeff Shultz
7. Make sure any access controlled rooms have physical keys that are
available at need - and aren't secured by the same access control that they
are to circumvent. .
8. Don't make your access control dependent on internet access - always
have something on the local network  it can fall back to.

That last thing, that apparently their access control failed, locking
people out when either their outward facing DNS and/or BGP routes went
goodbye, is perhaps the most astounding thing to me - making your access
control into an IoT device without (apparently) a quick workaround for a
failure in the "I" part.

On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:

>
>
> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
> >
> > How come such a large operation does not have an out of bound access in
> case of emergencies ???
> >
> >
>
> I mentioned to someone yesterday that most OOB systems _are_ the
> internet.  It doesn’t always seem like you need things like modems or
> dial-backup, or access to these services, except when you do it’s
> critical/essential.
>
> A few reminders for people:
>
> 1) Program your co-workers into your cell phone
> 2) Print out an emergency contact sheet
> 3) Have a backup conference bridge/system that you test
>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
> Audio bridge?
>   - No judgement, but do test the system!
> 4) Know how to access the office and who is closest.
>   - What happens if they are in the hospital, sick or on vacation?
> 5) Complacency is dangerous
>   - When the tools “just work” you never imagine the tools won’t work.
> I’m sure the lessons learned will be long internally.
>   - I hope they share them externally so others can learn.
> 6) No really, test the backup process.
>
>
>
> * interlude *
>
> Back at my time at 2914 - one reason we all had T1’s at home was largely
> so we could get in to the network should something bad happen.  My home IP
> space was in the router ACLs.  Much changed since those early days as this
> network became more reliable.  We’ve seen large outages in the past 2 years
> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
> my memory).
>
> Plan for the outages and make sure you understand your playbook.  It may
> be from snow day to all hands on deck.  Test it at least once, and ideally
> with someone who will challenge a few assumptions (eg: that the cell
> network will be up)
>
> - Jared



-- 
Jeff Shultz

-- 
Like us on Social Media for News, Promotions, and other information!!

   
      
      
      














_ This message 
contains confidential information and is intended only for the individual 
named. If you are not the named addressee you should not disseminate, 
distribute or copy this e-mail. Please notify the sender immediately by 
e-mail if you have received this e-mail by mistake and delete this e-mail 
from your system. E-mail transmission cannot be guaranteed to be secure or 
error-free as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses. The sender therefore does 
not accept liability for any errors or omissions in the contents of this 
message, which arise as a result of e-mail transmission. _



Re: Facebook post-mortems...

2021-10-05 Thread Ryan Brooks


> On Oct 5, 2021, at 10:32 AM, Jean St-Laurent via NANOG  
> wrote:
> 
> If you have some DNS working, you can point it at a static “we are down and 
> we know it” page much sooner,

At the scale of facebook that seems extremely difficult to pull off w/o most of 
their architecture online.  Imagine trying to terminate >billion sessions.

When they started to come back up and had their "We're sorry" page up- even 
their static png couldn't make it onto the wire.



Re: Facebook post-mortems... - Update!

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 9:56 AM Mark Tinka  wrote:

>
>
> On 10/5/21 15:40, Mark Tinka wrote:
>
> >
> > I don't disagree with you one bit. It's for that exact reason that we
> > built:
> >
> > https://as37100.net/
> >
> > ... not for us, but specifically for other random network operators
> > around the world whom we may never get to drink a crate of wine with.
>

Can someone explain to me, preferably in baby words, why so many providers
view information like https://as37100.net/?bgp as secret/proprietary?
I've interacted with numerous providers who require an NDA or pinky-swear
to get a list of their communities -- is this really just 1: security
through obscurity, 2: an artifact of the culture of not sharing, 3: an
attempt to seem cool by making you jump through hoops to prove your
worthiness, 4: some weird 'mah competitors won't be able to figure out my
secret sauce without knowing that 17 means Asia, or 5: something else?

Yes, some providers do publish these (usually on the website equivalent of
a locked filing cabinet stuck in a disused lavatory with a sign on the door
saying ‘Beware of the Leopard.”), and PeeringDB has definitely helped, but
I still don't understand many providers stance on this...

W




> >
> > I have to say that it has likely cut e-mails to our NOC as well as
> > overall pain in half, if not more.
>
> What I forgot to add, however, is that unlike Facebook, we aren't a
> major content provider. So we don't have a need to parallel our DNS
> resiliency with our service resiliency, in terms of 3rd party
> infrastructure. If our network were to melt, we'll already be getting it
> from our eyeballs.
>
> If we had content of note that was useful to, say, a handful-billion
> people around the world, we'd give some thought - however complex - to
> having critical services running on 3rd party infrastructure.
>
> Mark.
>


-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Facebook post-mortems...

2021-10-05 Thread Joe Maimon




Mark Tinka wrote:


So I'm not worried about DNS stability when split across multiple 
physical entities.


I'm talking about the actual services being hosted on a single network 
that goes bye-bye like what we saw yesterday.


All the DNS resolution means diddly, even if it tells us that DNS is 
not the issue.


Mark.


You could put up a temp page or two. Like, the internet is not down, we 
are just having a bad day. Bear with us for a bit. Go outside and enjoy 
nature for the next few hours.


But more importantly, internal infrastructure domains, containing router 
names, bootstraps, tools, utilities, physical access control, config 
repositories, network documentations, oob-network names (who remembers 
those?) , oob-email, oob communications (messenger, conferences, voip), 
etc..


Doesnt even have to be globally registered. External DNS server in the 
resolver list of all tech laptops slaving the zone.


Rapid response requires certain amenities, or as we can see, your 
talking about hours just getting started.


Also, the oob-network needs to be used regularly or it will be 
essentially unusable when actually needed, due to bit rot (accumulation 
of unnoticed and unresolved issues) and lack of mind muscle memory.


It should be standard practice to deploy all new equipment from the 
oob-network servicing it. Install things how you want to be able to 
repair them.


Joe


RE: Facebook post-mortems...

2021-10-05 Thread Kain, Becki (.)
Why ever would have a card reader on your external facing network, if that was 
really the case why they couldn't get in to fix it?


-Original Message-
From: NANOG  On Behalf Of Patrick W. 
Gilmore
Sent: Monday, October 04, 2021 10:53 PM
To: North American Operators' Group 
Subject: Re: Facebook post-mortems...

WARNING: This message originated outside of Ford Motor Company. Use caution 
when opening attachments, clicking links, or responding.


Update about the October 4th outage

https://clicktime.symantec.com/3X9y1HrhXV7HkUEoMWnXtR67Vc?u=https%3A%2F%2Fengineering.fb.com%2F2021%2F10%2F04%2Fnetworking-traffic%2Foutage%2F

--
TTFN,
patrick

> On Oct 4, 2021, at 9:25 PM, Mel Beckman  wrote:
>
> The CF post mortem looks sensible, and a good summary of what we all saw from 
> the outside with BGP routes being withdrawn.
>
> Given the fragility of BGP, this could still end up being a malicious attack.
>
> -mel via cell
>
>> On Oct 4, 2021, at 6:19 PM, Jay Hennigan  wrote:
>>
>> On 10/4/21 17:58, jcur...@istaff.org wrote:
>>> Fairly abstract - Facebook Engineering - 
>>> https://clicktime.symantec.com/3CDR8hh26akhF2bhzN9S5cv7Vc?u=https%3A%2F%2Fm.facebook.com%2Fnt%2Fscreen%2F%3Fparams%3D%257B%2522note_id%2522%253A10158791436142200%257D%26path%3D%252Fnotes%252Fnote%252F%26_rdr
>>>  
>>> 
>>
>> I believe that the above link refers to a previous outage. The duration of 
>> the outage doesn't match today's, the technical explanation doesn't align 
>> very well, and many of the comments reference earlier dates.
>>
>>> Also, Cloudflare’s take on the outage - 
>>> https://clicktime.symantec.com/3EkkFFLL3nVZGvWBnB834uN7Vc?u=https%3A%2F%2Fblog.cloudflare.com%2Foctober-2021-facebook-outage%2F
>>>  
>>> 
>>
>> This appears to indeed reference today's event.
>>
>> --
>> Jay Hennigan - j...@west.net
>> Network Engineering - CCIE #7880
>> 503 897-8550 - WB6RDV


Re: Facebook post-mortems...

2021-10-05 Thread Ryan Landry
Niels, you are correct about my initial tweet, which I updated in later
tweets to clarify with a hat tip to Will Hargrave as thanks for seeking
more detail.

Cheers,
Ryan

On Tue, Oct 5, 2021 at 08:24 Niels Bakker  wrote:

> * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
> >Facebook stopped announcing the vast majority of their IP space to
> >the DFZ during this.
>
> People keep repeating this but I don't think it's true.
>
> It's probably based on this tweet:
> https://twitter.com/ryan505/status/1445118376339140618
>
> but that's an aggregate adding up prefix counts from many sessions.
> The total number of hosts covered by those announcements didn't vary
> by nearly as much, since to a significant extent it were more specifics
> (/24) of larger prefixes (e.g. /17) that disappeared, while those /17s
> stayed.
>
> (There were no covering prefixes for WhatsApp's NS addresses so those
> were completely unreachable from the DFZ.)
>
>
> -- Niels.
>


RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
1.  If you have some DNS working, you can point it at a static “we are down 
and we know it” page much sooner.
2.   

Good catch and you’re right that it would have reduce the planetary impact. 
Less call to help-desk and less reboot of devices. It would have give 
visibility on what’s happening.

 

It seems to be really resilient in today’s world, a business needs their NS in 
at least 2 different entities like amazon.com is doing.

 

Jean

 

From: NANOG  On Behalf Of Matthew 
Kaufman
Sent: October 5, 2021 10:59 AM
To: Mark Tinka 
Cc: nanog@nanog.org
Subject: Re: Facebook post-mortems...

 

 

 

On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka mailto:mark@tinka.africa> > wrote:



On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by 
> having NS in separate entities.

Well, doesn't really matter if you can resolve the A//MX records, 
but you can't connect to the network that is hosting the services.

 

Disagree for two reasons:

 

1. If you have some DNS working, you can point it at a static “we are down and 
we know it” page much sooner.

 

2. If you have convinced the entire world to install tracking pixels on their 
web pages that all need your IP address, it is rude to the rest of the world’s 
DNS to not be able to always provide a prompt (and cacheable) response.



Re: Facebook post-mortems...

2021-10-05 Thread Bjørn Mork
Jean St-Laurent via NANOG  writes:

> Let's check how these big companies are spreading their NS's.
>
> $ dig +short facebook.com NS
> d.ns.facebook.com.
> b.ns.facebook.com.
> c.ns.facebook.com.
> a.ns.facebook.com.
>
> $ dig +short google.com NS
> ns1.google.com.
> ns4.google.com.
> ns2.google.com.
> ns3.google.com.
>
> $ dig +short apple.com NS
> a.ns.apple.com.
> b.ns.apple.com.
> c.ns.apple.com.
> d.ns.apple.com.
>
> $ dig +short amazon.com NS
> ns4.p31.dynect.net.
> ns3.p31.dynect.net.
> ns1.p31.dynect.net.
> ns2.p31.dynect.net.
> pdns6.ultradns.co.uk.
> pdns1.ultradns.net.
>
> $ dig +short netflix.com NS
> ns-1372.awsdns-43.org.
> ns-1984.awsdns-56.co.uk.
> ns-659.awsdns-18.net.
> ns-81.awsdns-10.com.

Just to state the obvious: Names are irrelevant. Addresses are not.

These names are just place holders for the glue in the parent zone
anyway.  If you look behind the names you'll find that Apple spread
their servers between two ASes. So they are not as vulnerable as Google
and Facebook.


Bjørn


Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka



On 10/5/21 16:59, Matthew Kaufman wrote:



Disagree for two reasons:

1. If you have some DNS working, you can point it at a static “we are 
down and we know it” page much sooner.


Isn't that what Twirra is for, nowadays :-)...




2. If you have convinced the entire world to install tracking pixels 
on their web pages that all need your IP address, it is rude to the 
rest of the world’s DNS to not be able to always provide a prompt (and 
cacheable) response.


Agreed, but I know many an exec that signs the capex cheques who may 
find "rude" not a noteworthy discussion point when we submit the budget.


Not saying I think being rude is cool, but there is a reason we are 
here, now, today.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 16:49, Joe Greco wrote:


Unrealistic user expectations are not the point.  Users can demand
whatever unrealistic claptrap they wish to.


The user's expectations, today, are always going to be unrealistic, 
especially when they are able to enjoy a half-decent service free-of-charge.


The bar has moved. Nothing we can do about it but adapt.




The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests.  This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.


We are in agreement.

And it's no coincidence that the Facebook's of the world rely almost 
100% on non-human contact to give their users support. So that leaves 
us, infrastructure, in the firing line to pick up the slack for a lack 
of warm-body access to BigContent.




It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.


Not just large companies, but I believe all companies... and worse, not 
at ground level where folk on lists like these tend to keep in touch, 
but higher up where money decisions where caring about your footprint on 
other Internet settlers whom you may never meet matters.


You and I can bash our heads till they come home, but if the folk that 
need to say "Yes" to $$$ needed to help external parties troubleshoot 
better don't get it, then perhaps starting a NOG or some such is our 
best bet.




I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.


On the contrary, I looove customers. I wasn't into them, say, 12 
years ago, but since I began to understand that users will respond to 
empathy and value, I fell in love with them. They drive my entire 
thought-process and decision-making.


This is why I keep saying, "Users don't care about how we build the 
Internet", and they shouldn't. And I support that.


BigContent get it, and for better or worse, they are the ones who've set 
the bar higher than what most network operators are happy with.


Infrastructure still doesn't get it, and we are seeing the effects of 
that play out around the world, with the recent SK Broadband/Netflix 
debacle being the latest barbershop gossip.




However, some of those end users do have a point of contact up the
chain.  This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution.  What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works.  If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.


We are in agreement.



These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.


We are in agreement.

So let's see if Facebook can fix the scope of their DNS architecture, 
and whether others can learn from it. I know I have... even though we 
provide friendly secondary for a bunch of folk we are friends with, we 
haven't done the same for our own networks... all our stuff sits on just 
our network - granted in many different countries, but still, one AS.


It's been nagging at the back of my mind for yonks, but yesterday was 
the nudge I needed to get this organized; so off I go.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Matthew Kaufman
On Tue, Oct 5, 2021 at 5:44 AM Mark Tinka  wrote:

>
>
> On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:
>
> > Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>
> Well, doesn't really matter if you can resolve the A//MX records,
> but you can't connect to the network that is hosting the services.


Disagree for two reasons:

1. If you have some DNS working, you can point it at a static “we are down
and we know it” page much sooner.

2. If you have convinced the entire world to install tracking pixels on
their web pages that all need your IP address, it is rude to the rest of
the world’s DNS to not be able to always provide a prompt (and cacheable)
response.

>


Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 03:40:39PM +0200, Mark Tinka wrote:
> Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk 
> tickets had nothing to do with DNS. They likely all were - "Your 
> Internet is down, just fix it; we don't wanna know".

Unrealistic user expectations are not the point.  Users can demand
whatever unrealistic claptrap they wish to. 

The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests.  This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.

It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.

I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.

However, some of those end users do have a point of contact up the
chain.  This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution.  What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works.  If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.

These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov


RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Does anyone have info whether this network 69.171.240.0/20 was reachable during 
the outage.

 

Jean

 

From: NANOG  On Behalf Of Tom Beecher
Sent: October 5, 2021 10:30 AM
To: NANOG 
Subject: Re: Facebook post-mortems...

 

People keep repeating this but I don't think it's true.

 

My comment is solely sourced on my direct observations on my network, maybe 
30-45 minutes in. 

 

Everything except a few /24s disappeared from DFZ providers, but I still heard 
those prefixes from direct peerings. There was no disaggregation that I saw, 
just the big stuff gone. This was consistent over 5 continents from my 
viewpoints.

 

Others may have seen different things at different times. I do not run an 
eyeball so I had no need to continually monitor.  

 

On Tue, Oct 5, 2021 at 10:22 AM Niels Bakker mailto:na...@bakker.net> > wrote:

* telescop...@gmail.com   (Lou D) [Tue 05 Oct 
2021, 15:12 CEST]:
>Facebook stopped announcing the vast majority of their IP space to 
>the DFZ during this.

People keep repeating this but I don't think it's true.

It's probably based on this tweet: 
https://twitter.com/ryan505/status/1445118376339140618

but that's an aggregate adding up prefix counts from many sessions. 
The total number of hosts covered by those announcements didn't vary 
by nearly as much, since to a significant extent it were more specifics 
(/24) of larger prefixes (e.g. /17) that disappeared, while those /17s 
stayed.

(There were no covering prefixes for WhatsApp's NS addresses so those 
were completely unreachable from the DFZ.)


-- Niels.



Re: Disaster Recovery Process

2021-10-05 Thread Sean Donelan

On Wed, 6 Oct 2021, Karl Auer wrote:

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY
high up in the organisation. Their lines need to be open to the
decisionmakers in the emergency team(s). Their job is to put the fear
of a vengeful god into any idiot who tries to interfere with the
recovery process by e.g. demanding status reports at ten-minute
intervals.


A good idea I learned was designate separate "executive" conference room 
and "incident command" conference room.


Executives are only allowed in the executive conference room.  Executives 
are NOT allowed in any NOC/SOC/operations areas.  The executive conference 
room was well stocked with coffee, snacks, TVs, monitors, paper and 
easels.


An executive was anyone with a CxO, General Counsel, EVP, VP, etc. title. 
You know who you are :-)


One operations person (i.e. Director of Operations or designee for shift) 
would brief the executives when they wanted something, and take their 
suggestions back to the incident room.  The Incident Commander was 
God as far as the incident, with a pre-approved emergency budget 
authorization.


One compromise, we did allow one lawyer in the incident command conference 
room, but it was NOT the corporate General Counsel.


Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
>
> People keep repeating this but I don't think it's true.
>

My comment is solely sourced on my direct observations on my network, maybe
30-45 minutes in.

Everything except a few /24s disappeared from DFZ providers, but I still
heard those prefixes from direct peerings. There was no disaggregation that
I saw, just the big stuff gone. This was consistent over 5 continents from
my viewpoints.

Others may have seen different things at different times. I do not run an
eyeball so I had no need to continually monitor.

On Tue, Oct 5, 2021 at 10:22 AM Niels Bakker  wrote:

> * telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
> >Facebook stopped announcing the vast majority of their IP space to
> >the DFZ during this.
>
> People keep repeating this but I don't think it's true.
>
> It's probably based on this tweet:
> https://twitter.com/ryan505/status/1445118376339140618
>
> but that's an aggregate adding up prefix counts from many sessions.
> The total number of hosts covered by those announcements didn't vary
> by nearly as much, since to a significant extent it were more specifics
> (/24) of larger prefixes (e.g. /17) that disappeared, while those /17s
> stayed.
>
> (There were no covering prefixes for WhatsApp's NS addresses so those
> were completely unreachable from the DFZ.)
>
>
> -- Niels.
>


Re: Facebook post-mortems...

2021-10-05 Thread Niels Bakker

* telescop...@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:
Facebook stopped announcing the vast majority of their IP space to 
the DFZ during this.


People keep repeating this but I don't think it's true.

It's probably based on this tweet: 
https://twitter.com/ryan505/status/1445118376339140618


but that's an aggregate adding up prefix counts from many sessions. 
The total number of hosts covered by those announcements didn't vary 
by nearly as much, since to a significant extent it were more specifics 
(/24) of larger prefixes (e.g. /17) that disappeared, while those /17s 
stayed.


(There were no covering prefixes for WhatsApp's NS addresses so those 
were completely unreachable from the DFZ.)



-- Niels.


Re: Disaster Recovery Process

2021-10-05 Thread Jared Mauch



> On Oct 5, 2021, at 10:05 AM, Karl Auer  wrote:
> 
> On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:
>> A few reminders for people:
>> [excellent list snipped]
> 
> I'd add one "soft" list item:
> 
> - in your emergency plan, have one or two people nominated who are VERY
> high up in the organisation. Their lines need to be open to the
> decisionmakers in the emergency team(s). Their job is to put the fear
> of a vengeful god into any idiot who tries to interfere with the
> recovery process by e.g. demanding status reports at ten-minute
> intervals.

At $dayjob we split the technical updates on a different bridge from the 
business updates.

There is a dedicated team to coordinate an entire thing, they can be low 
severity (risk) or high severity (whole business impacting).

They provide the timeline to next update and communicate what tasks are being 
done.  There’s even training on how to be a SME in the environment.  

Nothing is perfect but this runs very smooth at $dayjob

- Jared



Re: Disaster Recovery Process

2021-10-05 Thread Karl Auer
On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:
> A few reminders for people:
> [excellent list snipped]

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY
high up in the organisation. Their lines need to be open to the
decisionmakers in the emergency team(s). Their job is to put the fear
of a vengeful god into any idiot who tries to interfere with the
recovery process by e.g. demanding status reports at ten-minute
intervals.

Regards, K.

-- 
~~~
Karl Auer (ka...@biplane.com.au)
http://www.biplane.com.au/kauer

GPG fingerprint: 61A0 99A9 8823 3A75 871E 5D90 BADB B237 260C 9C58
Old fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170





RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
As of now, their MX is hosted on 69.171.251.251

 

Was this network still announced yesterday in the DFZ during the outage? 

69.171.224.0/19 

69.171.240.0/20

 

Jean

 

From: Jean St-Laurent  
Sent: October 5, 2021 9:50 AM
To: 'Tom Beecher' 
Cc: 'Jeff Tantsura' ; 'William Herrin' 
; 'NANOG' 
Subject: RE: Facebook post-mortems...

 

I agree to resolve non-routable address doesn’t bring you a working service.

 

I thought a few networks were still reachable like their MX or some DRP 
networks.

 

Thanks for the update

Jean

 

From: Tom Beecher mailto:beec...@beecher.cc> > 
Sent: October 5, 2021 8:33 AM
To: Jean St-Laurent mailto:j...@ddostest.me> >
Cc: Jeff Tantsura mailto:jefftant.i...@gmail.com> >; 
William Herrin mailto:b...@herrin.us> >; NANOG 
mailto:nanog@nanog.org> >
Subject: Re: Facebook post-mortems...

 

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

 

Assuming they had such a thing in place , it would not have helped. 

 

Facebook stopped announcing the vast majority of their IP space to the DFZ 
during this. So even they did have an offnet DNS server that could have 
provided answers to clients, those same clients probably wouldn't have been 
able to connect to the IPs returned anyways. 

 

If you are running your own auths like they are, you likely view your public 
network reachability as almost bulletproof and that it will never disappear. 
Which is probably true most of the time. Until yesterday happens and the 9's in 
your reliability percentage change to 7's. 

 

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG mailto:nanog@nanog.org> > wrote:

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com   NS
d.ns.facebook.com  .
b.ns.facebook.com  .
c.ns.facebook.com  .
a.ns.facebook.com  .

$ dig +short google.com   NS
ns1.google.com  .
ns4.google.com  .
ns2.google.com  .
ns3.google.com  .

$ dig +short apple.com   NS
a.ns.apple.com  .
b.ns.apple.com  .
c.ns.apple.com  .
d.ns.apple.com  .

$ dig +short amazon.com   NS
ns4.p31.dynect.net  .
ns3.p31.dynect.net  .
ns1.p31.dynect.net  .
ns2.p31.dynect.net  .
pdns6.ultradns.co.uk  .
pdns1.ultradns.net  .

$ dig +short netflix.com   NS
ns-1372.awsdns-43.org  .
ns-1984.awsdns-56.co.uk  .
ns-659.awsdns-18.net  .
ns-81.awsdns-10.com  .

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com  , 
google.com   and apple.com  

Jean

-Original Message-
From: NANOG mailto:ddostest...@nanog.org> > On Behalf Of Jeff Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin mailto:b...@herrin.us> >
Cc: nanog@nanog.org  
Subject: Re: Facebook post-mortems...

129.134.30.0/23  , 129.134.30.0/24 
 , 129.134.31.0/24  . The 
specific routes covering all 4 nameservers (a-d) were withdrawn from all FB 
peering at approximately 15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin   > wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas   > wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us  
> https://bill.herrin.us/



Re: Facebook post-mortems... - Update!

2021-10-05 Thread Mark Tinka




On 10/5/21 15:40, Mark Tinka wrote:



I don't disagree with you one bit. It's for that exact reason that we 
built:


    https://as37100.net/

... not for us, but specifically for other random network operators 
around the world whom we may never get to drink a crate of wine with.


I have to say that it has likely cut e-mails to our NOC as well as 
overall pain in half, if not more.


What I forgot to add, however, is that unlike Facebook, we aren't a 
major content provider. So we don't have a need to parallel our DNS 
resiliency with our service resiliency, in terms of 3rd party 
infrastructure. If our network were to melt, we'll already be getting it 
from our eyeballs.


If we had content of note that was useful to, say, a handful-billion 
people around the world, we'd give some thought - however complex - to 
having critical services running on 3rd party infrastructure.


Mark.


RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
I agree to resolve non-routable address doesn’t bring you a working service.

 

I thought a few networks were still reachable like their MX or some DRP 
networks.

 

Thanks for the update

Jean

 

From: Tom Beecher  
Sent: October 5, 2021 8:33 AM
To: Jean St-Laurent 
Cc: Jeff Tantsura ; William Herrin ; 
NANOG 
Subject: Re: Facebook post-mortems...

 

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

 

Assuming they had such a thing in place , it would not have helped. 

 

Facebook stopped announcing the vast majority of their IP space to the DFZ 
during this. So even they did have an offnet DNS server that could have 
provided answers to clients, those same clients probably wouldn't have been 
able to connect to the IPs returned anyways. 

 

If you are running your own auths like they are, you likely view your public 
network reachability as almost bulletproof and that it will never disappear. 
Which is probably true most of the time. Until yesterday happens and the 9's in 
your reliability percentage change to 7's. 

 

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG mailto:nanog@nanog.org> > wrote:

Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com   NS
d.ns.facebook.com  .
b.ns.facebook.com  .
c.ns.facebook.com  .
a.ns.facebook.com  .

$ dig +short google.com   NS
ns1.google.com  .
ns4.google.com  .
ns2.google.com  .
ns3.google.com  .

$ dig +short apple.com   NS
a.ns.apple.com  .
b.ns.apple.com  .
c.ns.apple.com  .
d.ns.apple.com  .

$ dig +short amazon.com   NS
ns4.p31.dynect.net  .
ns3.p31.dynect.net  .
ns1.p31.dynect.net  .
ns2.p31.dynect.net  .
pdns6.ultradns.co.uk  .
pdns1.ultradns.net  .

$ dig +short netflix.com   NS
ns-1372.awsdns-43.org  .
ns-1984.awsdns-56.co.uk  .
ns-659.awsdns-18.net  .
ns-81.awsdns-10.com  .

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com  , 
google.com   and apple.com  

Jean

-Original Message-
From: NANOG mailto:ddostest...@nanog.org> > On Behalf Of Jeff Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin mailto:b...@herrin.us> >
Cc: nanog@nanog.org  
Subject: Re: Facebook post-mortems...

129.134.30.0/23  , 129.134.30.0/24 
 , 129.134.31.0/24  . The 
specific routes covering all 4 nameservers (a-d) were withdrawn from all FB 
peering at approximately 15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin   > wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas   > wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us  
> https://bill.herrin.us/



Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 15:04, Joe Greco wrote:



You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?


That and Jane + Thando likely re-installing all their apps and 
iOS/Android on their phones, and rebooting them 300 times in the hopes 
that Facebook and WhatsApp would work.


Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk 
tickets had nothing to do with DNS. They likely all were - "Your 
Internet is down, just fix it; we don't wanna know".




There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.


I don't disagree with you one bit. It's for that exact reason that we built:

    https://as37100.net/

... not for us, but specifically for other random network operators 
around the world whom we may never get to drink a crate of wine with.


I have to say that it has likely cut e-mails to our NOC as well as 
overall pain in half, if not more.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 14:58, Jean St-Laurent wrote:

If your NS are in 2 separate entities, you could still resolve your 
MX/A//NS.

Look how Amazon is doing it.

dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

They use dyn DNS from Oracle and ultradns. 2 very strong network of anycast DNS 
servers.

Amazon would have not been impacted like Facebook yesterday. Unless ultradns 
and Oracle have their DNS servers hosted in Amazon infra? I doubt that Oracle 
has dns hosted in Amazon, but it's possible.

Probably the management overhead to use 2 different entities for DNS is not 
financially viable?


So I'm not worried about DNS stability when split across multiple 
physical entities.


I'm talking about the actual services being hosted on a single network 
that goes bye-bye like what we saw yesterday.


All the DNS resolution means diddly, even if it tells us that DNS is not 
the issue.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:57:42PM +0200, Mark Tinka wrote:
> 
> 
> On 10/5/21 14:52, Joe Greco wrote:
> 
> >That's not quite true.  It still gives much better clue as to what is
> >going on; if a host resolves to an IP but isn't pingable/traceroutable,
> >that is something that many more techy people will understand than if
> >the domain is simply unresolvable.  Not everyone has the skill set and
> >knowledge of DNS to understand how to track down what nameservers
> >Facebook is supposed to have, and how to debug names not resolving.
> >There are lots of helpdesk people who are not expert in every topic.
> >
> >Having DNS doesn't magically get you service back, of course, but it
> >leaves a better story behind than simply vanishing from the network.
> 
> That's great for you and me who believe in and like troubleshooting.
> 
> Jane and Thando who just want their Instagram timeline feed couldn't 
> care less about DNS working but network access is down. To them, it's 
> broken, despite your state-of-the-art global DNS architecture.

You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?

There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov


Re: Facebook post-mortems...

2021-10-05 Thread Masataka Ohta

Carsten Bormann wrote:


While Ruby indeed has a chain-saw (read: powerful, dangerous, still
the tool of choice in certain cases) in its toolkit that is generally
called “monkey-patching”, I think Michael was actually thinking about
the “chaos monkey”, 
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey 
https://netflix.github.io/chaosmonkey/


That was a Netflix invention, but see also 
https://en.wikipedia.org/wiki/Chaos_engineering#Facebook_Storm


It seems to me that so called chaos engineering assumes cosmic
internet environment, though, in good old days, we were aware
that the Internet is the source of chaos.

Masataka Ohta


RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
If your NS are in 2 separate entities, you could still resolve your 
MX/A//NS.

Look how Amazon is doing it.

dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

They use dyn DNS from Oracle and ultradns. 2 very strong network of anycast DNS 
servers.

Amazon would have not been impacted like Facebook yesterday. Unless ultradns 
and Oracle have their DNS servers hosted in Amazon infra? I doubt that Oracle 
has dns hosted in Amazon, but it's possible.

Probably the management overhead to use 2 different entities for DNS is not 
financially viable?

Jean

-Original Message-
From: NANOG  On Behalf Of Mark Tinka
Sent: October 5, 2021 8:22 AM
To: nanog@nanog.org
Subject: Re: Facebook post-mortems...



On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by 
> having NS in separate entities.

Well, doesn't really matter if you can resolve the A//MX records, but you 
can't connect to the network that is hosting the services.

At any rate, having 3rd party DNS hosting for your domain is always a good 
thing to have. But in reality, it only hits the spot if the service is also 
available on a 3rd party network, otherwise, you keep DNS up, but get no 
service.

Mark.




Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 14:52, Joe Greco wrote:


That's not quite true.  It still gives much better clue as to what is
going on; if a host resolves to an IP but isn't pingable/traceroutable,
that is something that many more techy people will understand than if
the domain is simply unresolvable.  Not everyone has the skill set and
knowledge of DNS to understand how to track down what nameservers
Facebook is supposed to have, and how to debug names not resolving.
There are lots of helpdesk people who are not expert in every topic.

Having DNS doesn't magically get you service back, of course, but it
leaves a better story behind than simply vanishing from the network.


That's great for you and me who believe in and like troubleshooting.

Jane and Thando who just want their Instagram timeline feed couldn't 
care less about DNS working but network access is down. To them, it's 
broken, despite your state-of-the-art global DNS architecture.


I'm also yet to find any DNS operator who makes deploying 3rd party 
resiliency to give other random network operators in the wild 
troubleshooting joy their #1 priority for doing so :-).


On the real though, I'm all for as much useful redundancy as we can get 
away with. But given just how much we rely on the web for basic life 
these days, we need to do better about making actual services as 
resilient as we can (and have) the DNS.


Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Lou D
Facebook stopped announcing the vast majority of their IP space to the DFZ
during this.

This is where I would like to learn more about the outage. Direct Peering
FB connections saw a drop in a networks (about a dozen) and one the
networks covered their C and D Nameservers but the block for A and B name
servers remained advertised but simply not responsive .
I imagine the dropped blocks could have prevented internal responses but an
suprise all of these issue would stem from the perspective I have .

On Tue, Oct 5, 2021 at 8:48 AM Tom Beecher  wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by
>> having NS in separate entities.
>>
>
> Assuming they had such a thing in place , it would not have helped.
>
> Facebook stopped announcing the vast majority of their IP space to the DFZ
> during this. So even they did have an offnet DNS server that could have
> provided answers to clients, those same clients probably wouldn't have been
> able to connect to the IPs returned anyways.
>
> If you are running your own auths like they are, you likely view your
> public network reachability as almost bulletproof and that it will never
> disappear. Which is probably true most of the time. Until yesterday happens
> and the 9's in your reliability percentage change to 7's.
>
> On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG 
> wrote:
>
>> Maybe withdrawing those routes to their NS could have been mitigated by
>> having NS in separate entities.
>>
>> Let's check how these big companies are spreading their NS's.
>>
>> $ dig +short facebook.com NS
>> d.ns.facebook.com.
>> b.ns.facebook.com.
>> c.ns.facebook.com.
>> a.ns.facebook.com.
>>
>> $ dig +short google.com NS
>> ns1.google.com.
>> ns4.google.com.
>> ns2.google.com.
>> ns3.google.com.
>>
>> $ dig +short apple.com NS
>> a.ns.apple.com.
>> b.ns.apple.com.
>> c.ns.apple.com.
>> d.ns.apple.com.
>>
>> $ dig +short amazon.com NS
>> ns4.p31.dynect.net.
>> ns3.p31.dynect.net.
>> ns1.p31.dynect.net.
>> ns2.p31.dynect.net.
>> pdns6.ultradns.co.uk.
>> pdns1.ultradns.net.
>>
>> $ dig +short netflix.com NS
>> ns-1372.awsdns-43.org.
>> ns-1984.awsdns-56.co.uk.
>> ns-659.awsdns-18.net.
>> ns-81.awsdns-10.com.
>>
>> Amnazon and Netflix seem to not keep their eggs in the same basket. From
>> a first look, they seem more resilient than facebook.com, google.com and
>> apple.com
>>
>> Jean
>>
>> -Original Message-
>> From: NANOG  On Behalf Of Jeff
>> Tantsura
>> Sent: October 5, 2021 2:18 AM
>> To: William Herrin 
>> Cc: nanog@nanog.org
>> Subject: Re: Facebook post-mortems...
>>
>> 129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes
>> covering all 4 nameservers (a-d) were withdrawn from all FB peering at
>> approximately 15:40 UTC.
>>
>> Cheers,
>> Jeff
>>
>> > On Oct 4, 2021, at 22:45, William Herrin  wrote:
>> >
>> > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> >> They have a monkey patch subsystem. Lol.
>> >
>> > Yes, actually, they do. They use Chef extensively to configure
>> > operating systems. Chef is written in Ruby. Ruby has something called
>> > Monkey Patches. This is where at an arbitrary location in the code you
>> > re-open an object defined elsewhere and change its methods.
>> >
>> > Chef doesn't always do the right thing. You tell Chef to remove an RPM
>> > and it does. Even if it has to remove half the operating system to
>> > satisfy the dependencies. If you want it to do something reasonable,
>> > say throw an error because you didn't actually tell it to remove half
>> > the operating system, you have a choice: spin up a fork of chef with a
>> > couple patches to the chef-rpm interaction or just monkey-patch it in
>> > one of your chef recipes.
>> >
>> > Regards,
>> > Bill Herrin
>> >
>> > --
>> > William Herrin
>> > b...@herrin.us
>> > https://bill.herrin.us/
>>
>>


Re: Facebook post-mortems...

2021-10-05 Thread Hank Nussbacher

On 05/10/2021 13:17, Hauke Lampe wrote:

On 05.10.21 07:22, Hank Nussbacher wrote:


Thanks for the posting.  How come they couldn't access their routers via
their OOB access?


My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.

https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf



Thanks for sharing that article.  But OOB access involves exactly that - 
Out Of Band - meaning one doesn't depend on any infrastructure prefixes 
or DFZ announced prefixes.  OOB access is usually via a local ADSL or 
wireless modem connected to the BFR.  The article does not discuss OOB 
at all.


Regards,
Hank


Re: Facebook post-mortems...

2021-10-05 Thread Joe Greco
On Tue, Oct 05, 2021 at 02:22:09PM +0200, Mark Tinka wrote:
> 
> 
> On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:
> 
> >Maybe withdrawing those routes to their NS could have been mitigated by 
> >having NS in separate entities.
> 
> Well, doesn't really matter if you can resolve the A//MX records, 
> but you can't connect to the network that is hosting the services.
> 
> At any rate, having 3rd party DNS hosting for your domain is always a 
> good thing to have. But in reality, it only hits the spot if the service 
> is also available on a 3rd party network, otherwise, you keep DNS up, 
> but get no service.

That's not quite true.  It still gives much better clue as to what is
going on; if a host resolves to an IP but isn't pingable/traceroutable,
that is something that many more techy people will understand than if
the domain is simply unresolvable.  Not everyone has the skill set and
knowledge of DNS to understand how to track down what nameservers 
Facebook is supposed to have, and how to debug names not resolving.
There are lots of helpdesk people who are not expert in every topic.

Having DNS doesn't magically get you service back, of course, but it
leaves a better story behind than simply vanishing from the network.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"The strain of anti-intellectualism has been a constant thread winding its way
through our political and cultural life, nurtured by the false notion that
democracy means that 'my ignorance is just as good as your knowledge.'"-Asimov


Disaster Recovery Process

2021-10-05 Thread Jared Mauch



> On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
> 
> How come such a large operation does not have an out of bound access in case 
> of emergencies ???
> 
> 

I mentioned to someone yesterday that most OOB systems _are_ the internet.  It 
doesn’t always seem like you need things like modems or dial-backup, or access 
to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone
2) Print out an emergency contact sheet
3) Have a backup conference bridge/system that you test
  - if zoom/webex/ms are down, where do you go?  Slack?  Google meet? Audio 
bridge?
  - No judgement, but do test the system!
4) Know how to access the office and who is closest.  
  - What happens if they are in the hospital, sick or on vacation?
5) Complacency is dangerous
  - When the tools “just work” you never imagine the tools won’t work.  I’m 
sure the lessons learned will be long internally.  
  - I hope they share them externally so others can learn.
6) No really, test the backup process.



* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we 
could get in to the network should something bad happen.  My home IP space was 
in the router ACLs.  Much changed since those early days as this network became 
more reliable.  We’ve seen large outages in the past 2 years of platforms, 
carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).  

Plan for the outages and make sure you understand your playbook.  It may be 
from snow day to all hands on deck.  Test it at least once, and ideally with 
someone who will challenge a few assumptions (eg: that the cell network will be 
up)

- Jared

Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
>
> My speculative guess would be that OOB access to a few outbound-facing
> routers per DC does not help much if a configuration error withdraws the
> infrastructure prefixes down to the rack level while dedicated OOB to
> each RSW would be prohibitive.
>

If your OOB has any dependence on the inband side, it's not OOB.

It's not complicated to have a completely independent OOB infra , even at
scale.

On Tue, Oct 5, 2021 at 8:40 AM Hauke Lampe  wrote:

> On 05.10.21 07:22, Hank Nussbacher wrote:
>
> > Thanks for the posting.  How come they couldn't access their routers via
> > their OOB access?
>
> My speculative guess would be that OOB access to a few outbound-facing
> routers per DC does not help much if a configuration error withdraws the
> infrastructure prefixes down to the rack level while dedicated OOB to
> each RSW would be prohibitive.
>
>
> https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf
>


Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 08:55, a...@nethead.de wrote:



Rumour is that when the FB route prefixes had been withdrawn their 
door authentication system stopped working and they could not get back 
into the building or server room :)


Assuming there is any truth to that, guess we can't cancel the hard 
lines yet :-).


#EverythingoIP

Mark.


RE: massive facebook outage presently

2021-10-05 Thread Jean St-Laurent via NANOG
I don't understand how this would have helped yesterday.

>From what is public so far, they really paint themselves in a corner with no 
>way out. A classic, but at epic scale.

They will learn and improve for sure, but I don't understand how "firmware 
default to your own network" would have help here.

Can you elaborate a bit please?

Jean

-Original Message-
From: NANOG  On Behalf Of Glenn Kelley
Sent: October 4, 2021 8:18 PM
To: nanog@nanog.org
Subject: Re: massive facebook outage presently

This is why you should have Routers that are Firmware Defaulted to your own 
network.  ALWAYS

Be it Calix or even a Mikrotik which you have setup with Netboot - having these 
default to your own setup is REALLY a game changer.

Without it - you are rolling trucks or at minimum taking heavy call volumes.


Glenn Kelley
Chief cook and Bottle Washing Watcher @ Connectivity.Engineer

On 10/4/2021 3:56 PM, Matt Hoppes wrote:
> Yes,
> We've seen that.
>
> On 10/4/21 4:33 PM, Eric Kuhnke wrote:
>> I am starting to see reports that in ISPs with very large numbers of 
>> residential users, customers are starting to press the factory-reset 
>> buttons on their home routers/modems/whatever, in an attempt to make 
>> Facebook work. This is resulting in much heavier than normal first 
>> tier support volumes. The longer it stays down the worse this is 
>> going to get.
>>
>>
>>
>> On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan > > wrote:
>>
>> On 10/4/21 12:11, b...@theworld.com  wrote:
>>  >
>>  > Although I believe it's generally true that if a company appears
>>  > prominently in the news it's liable to be attacked I assume 
>> because
>>  > the miscreants sit around thinking "hmm, who shall we attack 
>> today oh
>>  > look at that shiny headline!" I'd hate to ascribe any altruistic
>>  > motivation w/o some evidence like even a credible twitter post 
>> (maybe
>>  > they posted that on FB? :-)
>>
>> I personally believe that the outage was caused by human error 
>> and not
>> something malicious. Time will tell.
>>
>> However, if you missed the 60 Minutes piece, it was a former 
>> employee
>> who spoke out with some rather powerful observations. I don't think
>> that
>> this type of worldwide outage was caused by an outside bad actor. 
>> It is
>> certainly within the realm of possibility that it was an inside job.
>>
>> In other news:
>>
>> https://twitter.com/disclosetv/status/1445100931947892736?s=20
>>
>> -- Jay Hennigan - j...@west.net 
>> Network Engineering - CCIE #7880
>> 503 897-8550 - WB6RDV
>>



Re: Facebook post-mortems...

2021-10-05 Thread Tom Beecher
>
> Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>

Assuming they had such a thing in place , it would not have helped.

Facebook stopped announcing the vast majority of their IP space to the DFZ
during this. So even they did have an offnet DNS server that could have
provided answers to clients, those same clients probably wouldn't have been
able to connect to the IPs returned anyways.

If you are running your own auths like they are, you likely view your
public network reachability as almost bulletproof and that it will never
disappear. Which is probably true most of the time. Until yesterday happens
and the 9's in your reliability percentage change to 7's.

On Tue, Oct 5, 2021 at 8:10 AM Jean St-Laurent via NANOG 
wrote:

> Maybe withdrawing those routes to their NS could have been mitigated by
> having NS in separate entities.
>
> Let's check how these big companies are spreading their NS's.
>
> $ dig +short facebook.com NS
> d.ns.facebook.com.
> b.ns.facebook.com.
> c.ns.facebook.com.
> a.ns.facebook.com.
>
> $ dig +short google.com NS
> ns1.google.com.
> ns4.google.com.
> ns2.google.com.
> ns3.google.com.
>
> $ dig +short apple.com NS
> a.ns.apple.com.
> b.ns.apple.com.
> c.ns.apple.com.
> d.ns.apple.com.
>
> $ dig +short amazon.com NS
> ns4.p31.dynect.net.
> ns3.p31.dynect.net.
> ns1.p31.dynect.net.
> ns2.p31.dynect.net.
> pdns6.ultradns.co.uk.
> pdns1.ultradns.net.
>
> $ dig +short netflix.com NS
> ns-1372.awsdns-43.org.
> ns-1984.awsdns-56.co.uk.
> ns-659.awsdns-18.net.
> ns-81.awsdns-10.com.
>
> Amnazon and Netflix seem to not keep their eggs in the same basket. From a
> first look, they seem more resilient than facebook.com, google.com and
> apple.com
>
> Jean
>
> -Original Message-
> From: NANOG  On Behalf Of Jeff
> Tantsura
> Sent: October 5, 2021 2:18 AM
> To: William Herrin 
> Cc: nanog@nanog.org
> Subject: Re: Facebook post-mortems...
>
> 129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes
> covering all 4 nameservers (a-d) were withdrawn from all FB peering at
> approximately 15:40 UTC.
>
> Cheers,
> Jeff
>
> > On Oct 4, 2021, at 22:45, William Herrin  wrote:
> >
> > On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
> >> They have a monkey patch subsystem. Lol.
> >
> > Yes, actually, they do. They use Chef extensively to configure
> > operating systems. Chef is written in Ruby. Ruby has something called
> > Monkey Patches. This is where at an arbitrary location in the code you
> > re-open an object defined elsewhere and change its methods.
> >
> > Chef doesn't always do the right thing. You tell Chef to remove an RPM
> > and it does. Even if it has to remove half the operating system to
> > satisfy the dependencies. If you want it to do something reasonable,
> > say throw an error because you didn't actually tell it to remove half
> > the operating system, you have a choice: spin up a fork of chef with a
> > couple patches to the chef-rpm interaction or just monkey-patch it in
> > one of your chef recipes.
> >
> > Regards,
> > Bill Herrin
> >
> > --
> > William Herrin
> > b...@herrin.us
> > https://bill.herrin.us/
>
>


Re: Facebook post-mortems...

2021-10-05 Thread Mark Tinka




On 10/5/21 14:08, Jean St-Laurent via NANOG wrote:


Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.


Well, doesn't really matter if you can resolve the A//MX records, 
but you can't connect to the network that is hosting the services.


At any rate, having 3rd party DNS hosting for your domain is always a 
good thing to have. But in reality, it only hits the spot if the service 
is also available on a 3rd party network, otherwise, you keep DNS up, 
but get no service.


Mark.



Re: Facebook post-mortems...

2021-10-05 Thread Hauke Lampe
On 05.10.21 07:22, Hank Nussbacher wrote:

> Thanks for the posting.  How come they couldn't access their routers via
> their OOB access?

My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.

https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf


Re: Facebook post-mortems...

2021-10-05 Thread av

On 10/5/21 1:22 PM, Hank Nussbacher wrote:
Thanks for the posting.  How come they couldn't access their routers via 
their OOB access?


Rumour is that when the FB route prefixes had been withdrawn their door 
authentication system stopped working and they could not get back into 
the building or server room :)





Re: Facebook post-mortems...

2021-10-05 Thread Callahan Warlick
I think that was from an outage in 2010:
https://engineering.fb.com/2010/09/23/uncategorized/more-details-on-today-s-outage/



On Mon, Oct 4, 2021 at 6:19 PM Jay Hennigan  wrote:

> On 10/4/21 17:58, jcur...@istaff.org wrote:
> > Fairly abstract - Facebook Engineering -
> >
> https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> >  "note_id":10158791436142200}=/notes/note/&_rdr>
>
> I believe that the above link refers to a previous outage. The duration
> of the outage doesn't match today's, the technical explanation doesn't
> align very well, and many of the comments reference earlier dates.
>
> > Also, Cloudflare’s take on the outage -
> > https://blog.cloudflare.com/october-2021-facebook-outage/
> > 
>
> This appears to indeed reference today's event.
>
> --
> Jay Hennigan - j...@west.net
> Network Engineering - CCIE #7880
> 503 897-8550 - WB6RDV
>


Re: Facebook post-mortems...

2021-10-05 Thread Justin Keller
Per o comments, the linked Facebook outage was from around 5/15/21

On Mon, Oct 4, 2021 at 9:08 PM Rubens Kuhl  wrote:
>
> The FB one seems to be from a previous event. Downtime doesn't match,
> visible flaw effects don't either.
>
>
> Rubens
>
>
> On Mon, Oct 4, 2021 at 9:59 PM  wrote:
> >
> > Fairly abstract - Facebook Engineering - 
> > https://m.facebook.com/nt/screen/?params=%7B%22note_id%22%3A10158791436142200%7D=%2Fnotes%2Fnote%2F&_rdr
> >
> > Also, Cloudflare’s take on the outage - 
> > https://blog.cloudflare.com/october-2021-facebook-outage/
> >
> > FYI,
> > /John
> >


Re: massive facebook outage presently

2021-10-05 Thread Glenn Kelley
This is why you should have Routers that are Firmware Defaulted to your 
own network.  ALWAYS


Be it Calix or even a Mikrotik which you have setup with Netboot - 
having these default to your own setup is REALLY a game changer.


Without it - you are rolling trucks or at minimum taking heavy call 
volumes.



Glenn Kelley
Chief cook and Bottle Washing Watcher @ Connectivity.Engineer

On 10/4/2021 3:56 PM, Matt Hoppes wrote:

Yes,
We've seen that.

On 10/4/21 4:33 PM, Eric Kuhnke wrote:
I am starting to see reports that in ISPs with very large numbers of 
residential users, customers are starting to press the factory-reset 
buttons on their home routers/modems/whatever, in an attempt to make 
Facebook work. This is resulting in much heavier than normal first 
tier support volumes. The longer it stays down the worse this is 
going to get.




On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan > wrote:


    On 10/4/21 12:11, b...@theworld.com  wrote:
 >
 > Although I believe it's generally true that if a company appears
 > prominently in the news it's liable to be attacked I assume 
because
 > the miscreants sit around thinking "hmm, who shall we attack 
today oh

 > look at that shiny headline!" I'd hate to ascribe any altruistic
 > motivation w/o some evidence like even a credible twitter post 
(maybe

 > they posted that on FB? :-)

    I personally believe that the outage was caused by human error 
and not

    something malicious. Time will tell.

    However, if you missed the 60 Minutes piece, it was a former 
employee

    who spoke out with some rather powerful observations. I don't think
    that
    this type of worldwide outage was caused by an outside bad actor. 
It is

    certainly within the realm of possibility that it was an inside job.

    In other news:

https://twitter.com/disclosetv/status/1445100931947892736?s=20

    --     Jay Hennigan - j...@west.net 
    Network Engineering - CCIE #7880
    503 897-8550 - WB6RDV



Re: massive facebook outage presently

2021-10-05 Thread David Andrzejewski
I find it hilarious and ironic that their CTO had to use a competitor’s 
platform to confirm their outage. 


- dave

> On Oct 4, 2021, at 16:45, Hank Nussbacher  wrote:
> 
> On 04/10/2021 22:05, Jason Kuehl wrote:
> 
> BGP related:
> https://twitter.com/SGgrc/status/1445116435731296256
> as also related by FB CTO:
> https://twitter.com/atoonk/status/1445121351707070468
> 
> -Hank
> 
>> https://twitter.com/disclosetv/status/1445100931947892736?s=20 
>> 
>> On Mon, Oct 4, 2021 at 3:01 PM Tony Wicks > > wrote:
>>Didn't write that part of the automation script and that coder left...
>> > I got a mail that Facebook was leaving NLIX. Maybe someone
>>botched the
>> > script so they took down all BGP sessions instead of just NLIX
>>and now
>> > they can't access the equipment to put it back... :-)
>> -- 
>> Sincerely,
>> Jason W Kuehl
>> Cell 920-419-8983
>> jason.w.ku...@gmail.com 
> 


Re: massive facebook outage presently

2021-10-05 Thread Jorge Amodio
How come such a large operation does not have an out of bound access in
case of emergencies ???

Somebody's getting fired !

-J


On Mon, Oct 4, 2021 at 3:51 PM Aaron C. de Bruyn via NANOG 
wrote:

> It looks like it might take a while according to a news reporter's tweet:
>
> "Was just on phone with someone who works for FB who described employees
> unable to enter buildings this morning to begin to evaluate extent of
> outage because their badges weren’t working to access doors."
>
> https://twitter.com/sheeraf/status/1445099150316503057?s=20
>
> -A
>
> On Mon, Oct 4, 2021 at 1:41 PM Eric Kuhnke  wrote:
>
>> I am starting to see reports that in ISPs with very large numbers of
>> residential users, customers are starting to press the factory-reset
>> buttons on their home routers/modems/whatever, in an attempt to make
>> Facebook work. This is resulting in much heavier than normal first tier
>> support volumes. The longer it stays down the worse this is going to get.
>>
>>
>>
>> On Mon, Oct 4, 2021 at 3:30 PM Jay Hennigan  wrote:
>>
>>> On 10/4/21 12:11, b...@theworld.com wrote:
>>> >
>>> > Although I believe it's generally true that if a company appears
>>> > prominently in the news it's liable to be attacked I assume because
>>> > the miscreants sit around thinking "hmm, who shall we attack today oh
>>> > look at that shiny headline!" I'd hate to ascribe any altruistic
>>> > motivation w/o some evidence like even a credible twitter post (maybe
>>> > they posted that on FB? :-)
>>>
>>> I personally believe that the outage was caused by human error and not
>>> something malicious. Time will tell.
>>>
>>> However, if you missed the 60 Minutes piece, it was a former employee
>>> who spoke out with some rather powerful observations. I don't think that
>>> this type of worldwide outage was caused by an outside bad actor. It is
>>> certainly within the realm of possibility that it was an inside job.
>>>
>>> In other news:
>>>
>>> https://twitter.com/disclosetv/status/1445100931947892736?s=20
>>>
>>> --
>>> Jay Hennigan - j...@west.net
>>> Network Engineering - CCIE #7880
>>> 503 897-8550 - WB6RDV
>>>
>>


Re: massive facebook outage presently

2021-10-05 Thread PJ Capelli via NANOG
Seems unlikely that FB internal controls would allow such a backdoor ...

"Never to get lost, is not living" - Rebecca Solnit

Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐

On Monday, October 4th, 2021 at 4:12 PM, Baldur Norddahl 
 wrote:

> On Mon, 4 Oct 2021 at 21:58, Michael Thomas  wrote:
> 

> > On 10/4/21 11:48 AM, Luke Guillory wrote:
> > 

> > > I believe the original change was 'automatic' (as in configuration done 
> > > via a web interface). However, now that connection to the outside world 
> > > is down, remote access to those tools don't exist anymore, so the 
> > > emergency procedure is to gain physical access to the peering routers and 
> > > do all the configuration locally.
> > 

> > Assuming that this is what actually happened, what should fb have done 
> > different (beyond the obvious of not screwing up the immediate issue)? This 
> > seems like it's a single point of failure. Should all of the BGP speakers 
> > have been dual homed or something like that? Or should they not have been 
> > mixing ops and production networks? Sorry if this sounds dumb.
> 

> Facebook is a huge network. It is doubtful that what is going on is this 
> simple. So I will make no guesses to what Facebook is or should be doing.
> 

> However the traditional way for us small timers is to have a backdoor using 
> someone else's network. Nowadays this could be a simple 4/5G router with a 
> VPN, to a terminal server that allows the operator to configure the equipment 
> through the monitor port even when the config is completely destroyed.
> 

> Regards,
> 

> Baldur

signature.asc
Description: OpenPGP digital signature


RE: Facebook post-mortems...

2021-10-05 Thread Jean St-Laurent via NANOG
Maybe withdrawing those routes to their NS could have been mitigated by having 
NS in separate entities.

Let's check how these big companies are spreading their NS's.

$ dig +short facebook.com NS
d.ns.facebook.com.
b.ns.facebook.com.
c.ns.facebook.com.
a.ns.facebook.com.

$ dig +short google.com NS
ns1.google.com.
ns4.google.com.
ns2.google.com.
ns3.google.com.

$ dig +short apple.com NS
a.ns.apple.com.
b.ns.apple.com.
c.ns.apple.com.
d.ns.apple.com.

$ dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

$ dig +short netflix.com NS
ns-1372.awsdns-43.org.
ns-1984.awsdns-56.co.uk.
ns-659.awsdns-18.net.
ns-81.awsdns-10.com.

Amnazon and Netflix seem to not keep their eggs in the same basket. From a 
first look, they seem more resilient than facebook.com, google.com and apple.com

Jean

-Original Message-
From: NANOG  On Behalf Of Jeff 
Tantsura
Sent: October 5, 2021 2:18 AM
To: William Herrin 
Cc: nanog@nanog.org
Subject: Re: Facebook post-mortems...

129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering 
all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 
15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure 
> operating systems. Chef is written in Ruby. Ruby has something called 
> Monkey Patches. This is where at an arbitrary location in the code you 
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM 
> and it does. Even if it has to remove half the operating system to 
> satisfy the dependencies. If you want it to do something reasonable, 
> say throw an error because you didn't actually tell it to remove half 
> the operating system, you have a choice: spin up a fork of chef with a 
> couple patches to the chef-rpm interaction or just monkey-patch it in 
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> --
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/



Re: facebook outage

2021-10-05 Thread Niels Bakker

* jllee9...@gmail.com (John Lee) [Tue 05 Oct 2021, 01:06 CEST]:
I was seeing NXDOMAIN errors, so I wonder if they had a DNS outage 
of some sort??


Were you using host(1)? Please don't, and use dig(1) instead. There 
were as far as I know at no point NXDOMAINs being returned, but due to 
the SERVFAILs host(1) was silently appending your local search domain 
to your query which would lead to incorrect NXDOMAIN output.



-- Niels.


Re: IRR for IX peers

2021-10-05 Thread Mark Tinka




On 10/5/21 09:29, Łukasz Bromirski wrote:


…like a, say, „single pane of glass”? ;)


Oh dear Lord :-)...

Mark.


Re: IRR for IX peers

2021-10-05 Thread Łukasz Bromirski


…like a, say, „single pane of glass”? ;)

-- 
./

> On 5 Oct 2021, at 06:25, Mark Tinka  wrote:
> 
> 
> 
>> On 10/4/21 21:55, Nick Hilliard wrote:
>> 
>>  Nearly 30 years on, this is still the state of the art.
> 
> Not an unlike an NMS... still can't walk into a shop and just buy one that 
> works out of the box :-).
> 
> Mark.


Re: Facebook post-mortems...

2021-10-05 Thread Carsten Bormann
On 5. Oct 2021, at 07:42, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure
> operating systems. Chef is written in Ruby. Ruby has something called
> Monkey Patches. 

While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of 
choice in certain cases) in its toolkit that is generally called 
“monkey-patching”, I think Michael was actually thinking about the “chaos 
monkey”, 
https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
https://netflix.github.io/chaosmonkey/

That was a Netflix invention, but see also
https://en.wikipedia.org/wiki/Chaos_engineering#Facebook_Storm

Grüße, Carsten



Re: Facebook post-mortems...

2021-10-05 Thread Jeff Tantsura
129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering 
all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 
15:40 UTC.

Cheers,
Jeff

> On Oct 4, 2021, at 22:45, William Herrin  wrote:
> 
> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas  wrote:
>> They have a monkey patch subsystem. Lol.
> 
> Yes, actually, they do. They use Chef extensively to configure
> operating systems. Chef is written in Ruby. Ruby has something called
> Monkey Patches. This is where at an arbitrary location in the code you
> re-open an object defined elsewhere and change its methods.
> 
> Chef doesn't always do the right thing. You tell Chef to remove an RPM
> and it does. Even if it has to remove half the operating system to
> satisfy the dependencies. If you want it to do something reasonable,
> say throw an error because you didn't actually tell it to remove half
> the operating system, you have a choice: spin up a fork of chef with a
> couple patches to the chef-rpm interaction or just monkey-patch it in
> one of your chef recipes.
> 
> Regards,
> Bill Herrin
> 
> -- 
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/