> On Jun 24, 2019, at 9:39 PM, Ross Tajvar <r...@tajvar.io> wrote:
> 
> 
> On Mon, Jun 24, 2019 at 9:01 PM Jared Mauch <ja...@puck.nether.net> wrote:
> >
> > > On Jun 24, 2019, at 8:50 PM, Ross Tajvar <r...@tajvar.io> wrote:
> > >
> > > Maybe I'm in the minority here, but I have higher standards for a T1 than 
> > > any of the other players involved. Clearly several entities failed to do 
> > > what they should have done, but Verizon is not a small or inexperienced 
> > > operation. Taking 8+ hours to respond to a critical operational problem 
> > > is what stood out to me as unacceptable.
> >
> > Are you talking about a press response or a technical one?  The impacts I 
> > saw were for around 2h or so based on monitoring I’ve had up since 2007.  
> > Not great but far from the worst as Tom mentioned.  I’ve seen people cease 
> > to announce IP space we reclaimed from them for months (or years) because 
> > of stale config.  I’ve also seen routes come back from the dead because 
> > they were pinned to an interface that was down for 2 years but never fully 
> > cleaned up.  (Then the telco looped the circuit, interface came up, route 
> > in table, announced globally — bad day all around).
> >
> 
> A technical one - see below from CF's blog post:
> "It is unfortunate that while we tried both e-mail and phone calls to reach 
> out to Verizon, at the time of writing this article (over 8 hours after the 
> incident), we have not heard back from them, nor are we aware of them taking 
> action to resolve the issue.”

I don’t know if CF is a customer (or not) of VZ, but it’s likely easy enough to 
find with a looking glass somewhere, but they were perhaps a few of the 20k 
prefixes impacted (as reported by others).

We have heard from them and not a lot of the other people, but most of them 
likely don’t do business with VZ directly.  I’m not sure VZ is going to contact 
them all or has the capability to respond to them all (or respond to 
non-customers except via a press release).

> > > And really - does it matter if the protection *was* there but something 
> > > broke it? I don't think it does. Ultimately, Verizon failed implement 
> > > correct protections on their network. And then failed to respond when it 
> > > became a problem.
> >
> > I think it does matter.  As I said in my other reply, people do things like 
> > drop ACLs to debug.  Perhaps that’s unsafe, but it is something you do to 
> > debug.  Not knowing what happened, I dunno.  It is also 2019 so I hold 
> > networks to a higher standard than I did in 2009 or 1999.
> >
> 
> Dropping an ACL is fine, but then you have to clean it up when you're done. 
> Your customers don't care that you almost didn't have an outage because you 
> almost did your job right. Yeah, there's a difference between not following 
> policy and not having a policy, but neither one is acceptable behavior from a 
> T1 imo. If it's that easy to cause an outage by not following policy, then I 
> argue that the policy should be better, or something should be better - 
> monitoring, automation, sanity checks. etc. There are lots of ways to solve 
> that problem. And in 2019 I really think there's no excuse for a T1 not to be 
> doing that kind of thing.

I don’t know about the outage (other than what I observed).  I offered some 
suggestions for people to help prevent it from happening, so I’ll leave it 
there.  We all make mistakes, I’ve been part of many and I’m sure that list 
isn’t yet complete.

- Jared

Reply via email to