Thought the list would like to see the Plivo outage post mortem.
Make sure you renew your domains people!
Beckman
---------------------------------------------------------------------------
Peter Beckman Internet Guy
[email protected] http://www.angryox.com/
---------------------------------------------------------------------------
---------- Forwarded message ----------
Date: Mon, 8 May 2017 16:11:52 EDT
From: Plivo <[email protected]>
Subject: Plivo Domain Outage Post Mortem
Plivo Domain Outage Post Mortem & Analysis
Dear Valued Customer,
On April 23, 2017, we experienced an outage on our primary domain (plivo.com)
and related subdomains. Although we had been sharing regular updates and
various workarounds during the domain outage, we would like to communicate our
root cause analysis, and the steps we will take to ensure this doesn’t happen
in the future.
** What happened and what was the impact?
------------------------------------------------------------
At 10:39 UTC on April 23, 2017, our team noticed that our primary domain
(plivo.com) and all of its related subdomains were unresolvable from most
countries, which resulted in an outage for customers across all services.
Our on-call team immediately began taking action, and within the next 4 hours
provided workarounds for our customers that ensured access to most of our
services. Customers were communicated about these updates via Twitter, a live
status update document and through their respective account managers.
Over the next 18 hours, while working with our domain registrar, we isolated
and corrected multiple configuration and provisioning errors. By 12:30 UTC on
April 24, 2017, all of our services were back up using most DNS providers
globally. However, a small percentage of our voice and sms customers had
increased latency and errors during the next few hours, which were resolved
immediately.
** Timeline
------------------------------------------------------------
April 23, 2017
* 10:39 UTC: plivo.com and its subdomains could not be resolved from various
locations globally. The On-Call team immediately started investigating the
issue.
* 10:42 UTC: Our domain showed up as being expired by our domain registrar.
* 10:50 UTC: Our engineers contacted our domain registrar to understand the
reason and resolved it with them while in parallel start implementing a
contingency action plan.
* 13:30 UTC: A patch was deployed to all of our servers to temporarily provide
a workaround for the unresolvable domain, and switch all of our internal tools
and servers to an internal domain name.
* 14:00 UTC: Our internal tools and servers were resolvable.
* 14:34 UTC: We communicated a workaround to our customers with the temporary
IPs of our services, which ensured that service was not disrupted.
* 15:00 UTC: Outbound calls to PSTN came back online.
* 15:30 UTC: We communicated new domain names to our carriers and worked to
re-establish Inbound Calls. At this time we saw 60% of our voice traffic back
up.
* 17:00 UTC: We released alternative links to our WebSDK on the live update
document.
* 17:30 UTC: Our registrar (phone.plivo.com) was patched, so that it accepted
direct connections using the IP Only, which allowed our customers to register.
* 19:46 UTC: Inbound SMS was back in service.
* 20:00 UTC: plivo.io was provisioned by setting up another cluster as a
workaround.
* 22:18 UTC: api.plivo.io came online as a backup for api.plivo.com.
* 22:26 UTC: manage.plivo.io came online as a backup for manage.plivo.com.
* 22:30 UTC: We released a new version of our WebSDK using a different domain
to mitigate the issues experienced by customers. At this time 80% of our Voice
traffic was back up.
* 23:12 UTC: We released new plivo.io domain names for customers using custom
inbound carriers.
April 24, 2017
* 00:38 UTC: phone.plivo.io and app.plivo.io came online, as temporary
replacements for phone.plivo.com and app.plivo.com.
* 03:30 UTC: We saw some plivo.com domains starting to resolve on their
original IP addresses. We kept monitoring the propagation of our nameservers.
* 09:30 UTC: Around 50% of the main DNS servers have been updated.
* 10:30 UTC: We asked some DNS Servers to refresh their caches for plivo.com to
expedite the propagation.
* 13:00 UTC: 90% of DNS Servers have been updated correctly with Plivo’s name
servers.
* 16:00 UTC: All Plivo services were fully operational and 99% of DNS Servers
updated correctly with Plivo’s name servers.
** Root-Cause Analysis
------------------------------------------------------------
Plivo’s primary domain (plivo.com) was set to renew automatically on April 17
annually. However, due to a configuration error with our registrar, instead of
automatically renewing, the domain expired. We did not see any issues with our
domain until April 23, 2017, as the registrar had a grace period of 5 days upon
expiration. Unfortunately, what also made this unnoticeable until the day of
the incident, was that we never received any updates, warnings or notifications
regarding the possible expiry of the domain.
Upon further drill down by our team with the registrar, we found no
notifications or alerts sent by our registrar regarding the expiry of the
domain and the auto renew for the domain was never triggered. Although we still
don’t have an official confirmation regarding this from the registrar, we
suspect it is due to a configuration error at their end.
Immediately after this we started working with the registrar to restore the
domain. The first reprovisioning order for the domain was stuck in a queued
state and never got executed for almost 4 hours. This was then escalated to
their team and we retried the restoration manually, which also failed multiple
times.
The official response we received from the registrar was that "Since the name server
on the domains have expired it usually takes 12-24 hours to reprovision the name servers
on the domain and some more time for the changes to propagate globally."
To get our services back online for our customers, we set up a temporary domain
at “plivo.io” and pointed all of our services to this new domain. Then, we
published this workaround in a live document that we updated throughout the
incident.
After almost 18 hours of working with the registrar, at 03:30 UTC April 24
2017, the order was finally executed successfully and we started seeing the
name servers update and reflected on some DNS providers.
** When will the workarounds expire?
------------------------------------------------------------
How long can customers continue to use plivo.io as a workaround domain?
We will ensure that the workaround domain “plivo.io” will remain operational
until May 31, 2017. We will decommission the domain after May 31, 2017.
How long can customers continue to hardcode IPs in etc/hosts?
We advise customers to go back to the plivo.com domain names as soon as
possible. Because of our elastic architecture, we cannot guarantee that these
IPs will stay the same in the near future.
It is especially critical to use api.plivo.com instead of the temporary IPs
that we provided during the incident. We will send reminders to all of our
customers who switched to the temporary domain to revert back to the original
Plivo.com domain and IPs.
** Related Service disruptions
------------------------------------------------------------
Between April 24-27, 2017, 30% percent of our customer traffic saw irregular
service degradation and disruptions in our SMS, Voice API & WebRTC/SIP service.
These incidents are related to the maintenance that was originally planned for
April 23, 2017. When the unexpected DNS issue hit, we were near the end of our
deployment that had the purpose of strengthening our Voice platform by making
phone.plivo.com more redundant. However, the new deployment created performance
issues in the form of locking and latency on a specific database table that was
accessed for most customers. This occurred every time when the traffic for
these services started spiking.
We worked on those issues and built a new internal service to optimize the
volume of data processed to avoid elevated database writes and latencies. We
also deployed patches on April 27, 2017 that improved the overall stability and
performance of our platform, while also readying it for much higher workloads.
** What are we doing about it for the future?
------------------------------------------------------------
While this incident was due to an error in configuration and provisioning by
our domain registrar, we take complete responsibility for this outage. We are
responsible in ensuring uptime of our services to our customers. Clearly with
better checks and thorough processes we could have avoided the whole situation
in spite of the error from the domain registrar.
This entire outage exposed some critical flaws in our dependency on our 3rd
party service providers. To ensure we minimize impact on Plivo’s services by
3rd party errors or issues, we have outlined a set of steps that we will
initiate immediately:
1. Categorize all 3rd party services into different three priority levels
(i.e., P1, P2, P3), based on potential impact on Plivo’s services. Detail
potential workarounds in the event of experiencing downtime from these
services. Perform monthly, quarterly, bi-annual, and annual audits and reviews
of all P1 and P2 services for renewal and configuration settings.
2. Plan and renew all related category of services like domain names, TLS
certificates, etc., for the longest period when possible. We have already
executed this for our domain by renewing it for the next 10 years. We will
execute the same strategy for all of our certificates and related services.
3. Setup automated monitoring to alert and notify all stakeholders in case any
of our domains or similar services get within a month of their expiry date.
Stakeholders will have the authority and access to take action immediately.
This will avoid dependency on vendor notifications.
4. When possible, update our SDKs to be able to dynamically update domain
endpoints, so a switch is possible at the customer's end without any
application code changes.
Our focus has always been to provide you the best quality of service and
uptime, and this disruption clearly came up short of expectation.
We apologize for the disruption and the inconvenience that this has caused your
business and to your customers. We will work harder to earn back your trust by
execution of all the steps that follow.
Sincerely,
The Plivo Team
https://www.plivo.com/?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017
http://twitter.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017
http://facebook.com/plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017
https://plus.google.com/+Plivo?utm_source=Domain+Post-mortem+-+Batch+3&utm_campaign=2b122008d5-DOMAIN_POSTMORTEM_2017_05_03&utm_medium=email&utm_term=0_aaa8bcee8c-2b122008d5-104044017
Copyright © 2017 Plivo All rights reserved / View as Webpage
(http://mailchi.mp/7bab18c30aec/plivo-update-all-services-back-up-697689?e=92ecfbb00b)
_______________________________________________
VoiceOps mailing list
[email protected]
https://puck.nether.net/mailman/listinfo/voiceops