[VoiceOps] Plivo Domain Outage Post Mortem (fwd)

Peter Beckman Mon, 08 May 2017 18:05:46 -0700

Thought the list would like to see the Plivo outage post mortem.


Make sure you renew your domains people!

Beckman
---------------------------------------------------------------------------
Peter Beckman                                                  Internet Guy
[email protected]                                 http://www.angryox.com/
---------------------------------------------------------------------------

---------- Forwarded message ----------
Date: Mon, 8 May 2017 16:11:52 EDT
From: Plivo <[email protected]>
Subject: Plivo Domain Outage Post Mortem

Plivo Domain Outage Post Mortem & Analysis

Dear Valued Customer,


On April 23, 2017, we experienced an outage on our primary domain (plivo.com) 
and related subdomains. Although we had been sharing regular updates and 
various workarounds during the domain outage, we would like to communicate our 
root cause analysis, and the steps we will take to ensure this doesn’t happen 
in the future.



** What happened and what was the impact?
------------------------------------------------------------

At 10:39 UTC on April 23, 2017, our team noticed that our primary domain 
(plivo.com) and all of its related subdomains were unresolvable from most 
countries, which resulted in an outage for customers across all services.


Our on-call team immediately began taking action, and within the next 4 hours 
provided workarounds for our customers that ensured access to most of our 
services. Customers were communicated about these updates via Twitter, a live 
status update document and through their respective account managers.


Over the next 18 hours, while working with our domain registrar, we isolated 
and corrected multiple configuration and provisioning errors. By 12:30 UTC on 
April 24, 2017, all of our services were back up using most DNS providers 
globally. However, a small percentage of our voice and sms customers had 
increased latency and errors during the next few hours, which were resolved 
immediately.



** Timeline
------------------------------------------------------------


April 23, 2017
* 10:39 UTC: plivo.com and its subdomains could not be resolved from various 
locations globally. The On-Call team immediately started investigating the 
issue.
* 10:42 UTC: Our domain showed up as being expired by our domain registrar.
* 10:50 UTC: Our engineers contacted our domain registrar to understand the 
reason and resolved it with them while in parallel start implementing a 
contingency action plan.
* 13:30 UTC: A patch was deployed to all of our servers to temporarily provide 
a workaround for the unresolvable domain, and switch all of our internal tools 
and servers to an internal domain name.
* 14:00 UTC: Our internal tools and servers were resolvable.
* 14:34 UTC: We communicated a workaround to our customers with the temporary 
IPs of our services, which ensured that service was not disrupted.
* 15:00 UTC: Outbound calls to PSTN came back online.
* 15:30 UTC: We communicated new domain names to our carriers and worked to 
re-establish Inbound Calls. At this time we saw 60% of our voice traffic back 
up.
* 17:00 UTC: We released alternative links to our WebSDK on the live update 
document.
* 17:30 UTC: Our registrar (phone.plivo.com) was patched, so that it accepted 
direct connections using the IP Only, which allowed our customers to register.
* 19:46 UTC: Inbound SMS was back in service.
* 20:00 UTC: plivo.io was provisioned by setting up another cluster as a 
workaround.
* 22:18 UTC: api.plivo.io came online as a backup for api.plivo.com.
* 22:26 UTC: manage.plivo.io came online as a backup for manage.plivo.com.
* 22:30 UTC: We released a new version of our WebSDK using a different domain 
to mitigate the issues experienced by customers. At this time 80% of our Voice 
traffic was back up.
* 23:12 UTC: We released new plivo.io domain names for customers using custom 
inbound carriers.



April 24, 2017
* 00:38 UTC: phone.plivo.io and app.plivo.io came online, as temporary 
replacements for phone.plivo.com and app.plivo.com.
* 03:30 UTC: We saw some plivo.com domains starting to resolve on their 
original IP addresses. We kept monitoring the propagation of our nameservers.
* 09:30 UTC: Around 50% of the main DNS servers have been updated.
* 10:30 UTC: We asked some DNS Servers to refresh their caches for plivo.com to 
expedite the propagation.
* 13:00 UTC: 90% of DNS Servers have been updated correctly with Plivo’s name 
servers.
* 16:00 UTC: All Plivo services were fully operational and 99% of DNS Servers 
updated correctly with Plivo’s name servers.




** Root-Cause Analysis
------------------------------------------------------------

Plivo’s primary domain (plivo.com) was set to renew automatically on April 17 
annually. However, due to a configuration error with our registrar, instead of 
automatically renewing, the domain expired. We did not see any issues with our 
domain until April 23, 2017, as the registrar had a grace period of 5 days upon 
expiration. Unfortunately, what also made this unnoticeable until the day of 
the incident, was that we never received any updates, warnings or notifications 
regarding the possible expiry of the domain.


Upon further drill down by our team with the registrar, we found no 
notifications or alerts sent by our registrar regarding the expiry of the 
domain and the auto renew for the domain was never triggered. Although we still 
don’t have an official confirmation regarding this from the registrar, we 
suspect it is due to a configuration error at their end.


Immediately after this we started working with the registrar to restore the 
domain. The first reprovisioning order for the domain was stuck in a queued 
state and never got executed for almost 4 hours. This was then escalated to 
their team and we retried the restoration manually, which also failed multiple 
times.


The official response we received from the registrar was that "Since the name server 
on the domains have expired it usually takes 12-24 hours to reprovision the name servers 
on the domain and some more time for the changes to propagate globally."


To get our services back online for our customers, we set up a temporary domain 
at “plivo.io” and pointed all of our services to this new domain. Then, we 
published this workaround in a live document that we updated throughout the 
incident.


After almost 18 hours of working with the registrar, at 03:30 UTC April 24 
2017, the order was finally executed successfully and we started seeing the 
name servers update and reflected on some DNS providers.



** When will the workarounds expire?
------------------------------------------------------------

How long can customers continue to use plivo.io as a workaround domain?

We will ensure that the workaround domain “plivo.io” will remain operational 
until May 31, 2017. We will decommission the domain after May 31, 2017.


How long can customers continue to hardcode IPs in etc/hosts?

We advise customers to go back to the plivo.com domain names as soon as 
possible. Because of our elastic architecture, we cannot guarantee that these 
IPs will stay the same in the near future.


It is especially critical to use api.plivo.com instead of the temporary IPs 
that we provided during the incident. We will send reminders to all of our 
customers who switched to the temporary domain to revert back to the original 
Plivo.com domain and IPs.



** Related Service disruptions
------------------------------------------------------------

Between April 24-27, 2017, 30% percent of our customer traffic saw irregular 
service degradation and disruptions in our SMS, Voice API & WebRTC/SIP service.


These incidents are related to the maintenance that was originally planned for 
April 23, 2017. When the unexpected DNS issue hit, we were near the end of our 
deployment that had the purpose of strengthening our Voice platform by making 
phone.plivo.com more redundant. However, the new deployment created performance 
issues in the form of locking and latency on a specific database table that was 
accessed for most customers. This occurred every time when the traffic for 
these services started spiking.


We worked on those issues and built a new internal service to optimize the 
volume of data processed to avoid elevated database writes and latencies. We 
also deployed patches on April 27, 2017 that improved the overall stability and 
performance of our platform, while also readying it for much higher workloads.



** What are we doing about it for the future?
------------------------------------------------------------

While this incident was due to an error in configuration and provisioning by 
our domain registrar, we take complete responsibility for this outage. We are 
responsible in ensuring uptime of our services to our customers. Clearly with 
better checks and thorough processes we could have avoided the whole situation 
in spite of the error from the domain registrar.


This entire outage exposed some critical flaws in our dependency on our 3rd 
party service providers. To ensure we minimize impact on Plivo’s services by 
3rd party errors or issues, we have outlined a set of steps that we will 
initiate immediately:

1. Categorize all 3rd party services into different three priority levels 
(i.e., P1, P2, P3), based on potential impact on Plivo’s services. Detail 
potential workarounds in the event of experiencing downtime from these 
services. Perform monthly, quarterly, bi-annual, and annual audits and reviews 
of all P1 and P2 services for renewal and configuration settings.
2. Plan and renew all related category of services like domain names, TLS 
certificates, etc., for the longest period when possible. We have already 
executed this for our domain by renewing it for the next 10 years. We will 
execute the same strategy for all of our certificates and related services.
3. Setup automated monitoring to alert and notify all stakeholders in case any 
of our domains or similar services get within a month of their expiry date. 
Stakeholders will have the authority and access to take action immediately. 
This will avoid dependency on vendor notifications.
4. When possible, update our SDKs to be able to dynamically update domain 
endpoints, so a switch is possible at the customer's end without any 
application code changes.



Our focus has always been to provide you the best quality of service and 
uptime, and this disruption clearly came up short of expectation.


We apologize for the disruption and the inconvenience that this has caused your 
business and to your customers. We will work harder to earn back your trust by 
execution of all the steps that follow.


Sincerely,
The Plivo Team


https://www.plivo.com/?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017

http://twitter.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017


http://facebook.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017


https://plus.google.com/+Plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017


Copyright © 2017 Plivo All rights reserved / View as Webpage 
(http://mailchi.mp/7bab18c30aec/plivo-update-all-services-back-up-697689?e=92ecfbb00b)

_______________________________________________
VoiceOps mailing list
[email protected]
https://puck.nether.net/mailman/listinfo/voiceops

[VoiceOps] Plivo Domain Outage Post Mortem (fwd)

Reply via email to